Re: [PATCH] cifs: perf improvement - use faster macros ALIGN() and round_up()

From: Steve French <smfrench@gmail.com>
To: Enzo Matsumiya <ematsumiya@suse.de>
Cc: "Aurélien Aptel" <aurelien.aptel@gmail.com>,
	linux-cifs@vger.kernel.org, pc@cjr.nz, ronniesahlberg@gmail.com,
	nspmangalore@gmail.com
Subject: Re: [PATCH] cifs: perf improvement - use faster macros ALIGN() and round_up()
Date: Tue, 13 Sep 2022 21:50:51 -0500	[thread overview]
Message-ID: <CAH2r5msyrbfM71WDpUggD7V_YTZhE50w7y4Umt=QjAG=Yfhz7g@mail.gmail.com> (raw)
In-Reply-To: <20220907204140.77ssfeib3zmwvqy2@cyberdelia>

Most of the changes look ok to me, although not much point in changing
the lines like:

     x = 2 /* sizeof(__u16) */ ....
to
     x = sizeof(__u16) ...

not sure that those  particular kinds of changes help enough with
readability (but they make backports harder) - and they have 0 impact
on performance.   So even if that type of change helps readability a
small amount, it could be split out from the things which could help
performance (and/or clearly improve readability).

The one area that looked confusing is this part.  Can you explain why
this part of the change?

@@ -2687,20 +2683,17 @@ int smb311_posix_mkdir(const unsigned int xid,
struct inode *inode,
                uni_path_len = (2 * UniStrnlen((wchar_t *)utf16_path,
PATH_MAX)) + 2;
                /* MUST set path len (NameLength) to 0 opening root of share */
                req->NameLength = cpu_to_le16(uni_path_len - 2);
-               if (uni_path_len % 8 != 0) {
-                       copy_size = roundup(uni_path_len, 8);
-                       copy_path = kzalloc(copy_size, GFP_KERNEL);
-                       if (!copy_path) {
-                               rc = -ENOMEM;
-                               goto err_free_req;
-                       }
-                       memcpy((char *)copy_path, (const char *)utf16_path,
-                              uni_path_len);
-                       uni_path_len = copy_size;
-                       /* free before overwriting resource */
-                       kfree(utf16_path);
-                       utf16_path = copy_path;
+               copy_size = round_up(uni_path_len, 8);
+               copy_path = kzalloc(copy_size, GFP_KERNEL);
+               if (!copy_path) {
+                       rc = -ENOMEM;
+                       goto err_free_req;
                }
+               memcpy((char *)copy_path, (const char *)utf16_path,
uni_path_len);
+               uni_path_len = copy_size;
+               /* free before overwriting resource */
+               kfree(utf16_path);
+               utf16_path = copy_path;
        }


On Wed, Sep 7, 2022 at 3:41 PM Enzo Matsumiya <ematsumiya@suse.de> wrote:
>
> On 09/07, Aurélien Aptel wrote:
> >Hi,
> >
> >Changes are good from a readability stand point but like the others
> >I'm very skeptical about the perf improvement claims.
>
> Just to be clear, as I re-read the commit message and realize I might have
> sounded a tad sensationalist; the "~50% improvement" was from the measure of
> the average latency for a single copy operation. As I replied to Tom Talpey in
> https://lore.kernel.org/linux-cifs/20220906160508.x4ahjzo3tr34w6k5@cyberdelia/
> the actual perceptible gain was only 0.02 to 0.05 seconds on my tests.
>
> Pardon me for any confusion.
>
>
> Enzo
>
> >On Tue, Sep 6, 2022 at 3:34 AM Enzo Matsumiya <ematsumiya@suse.de> wrote:
> >> But also replace (DIV_ROUND_UP(len, 8) * 8) with ALIGN(len, 8), which,
> >> if not optimized by the compiler, has the overhead of a multiplication
> >> and a division. Do the same for roundup() by replacing it by round_up()
> >> (division-less version, but requires the multiple to be a power of 2,
> >> which is always the case for us).
> >
> >Many of these compile to the same division-less instructions
> >especially if any of the values are known at compile time.
> >
> >> @@ -2305,7 +2305,7 @@ int CIFSSMBRenameOpenFile(const unsigned int xid, struct cifs_tcon *pTcon,
> >
> >smb1 and computed at compile time
> >
> >> @@ -3350,8 +3350,7 @@ validate_ntransact(char *buf, char **ppparm, char **ppdata,
> >
> >smb1
> >
> >> @@ -2833,9 +2833,12 @@ ip_rfc1001_connect(struct TCP_Server_Info *server)
> >> @@ -2871,8 +2874,12 @@ ip_rfc1001_connect(struct TCP_Server_Info *server)
> >
> >connect time
> >
> >> @@ -1318,7 +1313,7 @@ sess_auth_ntlmv2(struct sess_data *sess_data)
> >> @@ -1442,8 +1437,7 @@ sess_auth_kerberos(struct sess_data *sess_data)
> >> @@ -1494,7 +1488,7 @@ sess_auth_kerberos(struct sess_data *sess_data)
> >> @@ -1546,7 +1540,7 @@ _sess_auth_rawntlmssp_assemble_req(struct sess_data *sess_data)
> >> @@ -1747,7 +1741,7 @@ sess_auth_rawntlmssp_authenticate(struct sess_data *sess_data)
> >
> >connect time
> >
> >> @@ -207,7 +207,7 @@ smb2_compound_op(const unsigned int xid, struct cifs_tcon *tcon,
> >> -               size[0] = 1; /* sizeof __u8 See MS-FSCC section 2.4.11 */
> >> +               size[0] = sizeof(u8); /* See MS-FSCC section 2.4.11 */
> >> -               size[0] = 8; /* sizeof __le64 */
> >> +               size[0] = sizeof(__le64);
> >
> >Hot path but no functional change
> >
> >> @@ -248,7 +248,7 @@ smb2_check_message(char *buf, unsigned int len, struct TCP_Server_Info *server)
> >>                  * Some windows servers (win2016) will pad also the final
> >>                  * PDU in a compound to 8 bytes.
> >>                  */
> >> -               if (((calc_len + 7) & ~7) == len)
> >> +               if (ALIGN(calc_len, 8) == len)
> >
> >Hot path but should compile to the same thing
> >
> >> @@ -466,15 +466,14 @@ build_signing_ctxt(struct smb2_signing_capabilities *pneg_ctxt)
> >> @@ -511,8 +510,7 @@ build_netname_ctxt(struct smb2_netname_neg_context *pneg_ctxt, char *hostname)
> >> @@ -557,18 +555,18 @@ assemble_neg_contexts(struct smb2_negotiate_req *req,
> >> @@ -595,9 +593,7 @@ assemble_neg_contexts(struct smb2_negotiate_req *req,
> >> @@ -780,7 +776,7 @@ static int smb311_decode_neg_context(struct smb2_negotiate_rsp *rsp,
> >
> >connect time
> >
> >> @@ -2413,7 +2409,7 @@ create_sd_buf(umode_t mode, bool set_owner, unsigned int *len)
> >> @@ -2487,7 +2483,7 @@ create_sd_buf(umode_t mode, bool set_owner, unsigned int *len)
> >
> >only relevant if mounted with some acl flags
> >
> >> @@ -2581,7 +2577,7 @@ alloc_path_with_tree_prefix(__le16 **out_path, int *out_size, int *out_len,
> >> -       *out_size = roundup(*out_len * sizeof(__le16), 8);
> >> +       *out_size = round_up(*out_len * sizeof(__le16), 8);
> >
> >Hot path but from my experiments, round_up() compiles to an *extra*
> >instruction. See below.
> >
> >> @@ -2687,20 +2683,17 @@ int smb311_posix_mkdir(const unsigned int xid, struct inode *inode,
> >
> >Only relevant on posix mounts.
> >
> >> @@ -2826,9 +2819,7 @@ SMB2_open_init(struct cifs_tcon *tcon, struct TCP_Server_Info *server,
> >> -               copy_size = uni_path_len;
> >> -               if (copy_size % 8 != 0)
> >> -                       copy_size = roundup(copy_size, 8);
> >> +               copy_size = round_up(uni_path_len, 8);
> >> @@ -4090,7 +4081,7 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
> >> -                       *total_len = DIV_ROUND_UP(*total_len, 8) * 8;
> >> +                       *total_len = ALIGN(*total_len, 8);
> >
> >These 2 are also hot paths, but skeptical about the optimizations.
> >
> >I've looked at those macros in Compiler Explorer and sure enough they
> >compile to the same thing on x86_64.
> >Even worse, the optimized versions compile with extra instructions for some:
> >
> >https://godbolt.org/z/z1xhhW9sj
> >
> >size_t mod2(size_t num) {
> >    return num%2;
> >}
> >
> >size_t is_aligned2(size_t num) {
> >    return IS_ALIGNED(num, 2);
> >}
> >
> >mod2:
> >        mov     rax, rdi
> >        and     eax, 1
> >        ret
> >
> >is_aligned2:
> >        mov     rax, rdi
> >        not     rax  <=== extra
> >        and     eax, 1
> >        ret
> >--------------------------
> >
> >size_t align8(size_t num) {
> >    return ALIGN(num, 8);
> >}
> >
> >size_t align_andshift8(size_t num) {
> >    return ((num + 7) & ~7);
> >}
> >
> >
> >align8:
> >        lea     rax, [rdi+7]
> >        and     rax, -8
> >        ret
> >align_andshift8:
> >        lea     rax, [rdi+7]
> >        and     rax, -8
> >        ret
> >
> >same code
> >--------------------------
> >
> >size_t dru8(size_t num) {
> >    return DIV_ROUND_UP(num, 8)*8;
> >}
> >
> >size_t rnup8_a(size_t num) {
> >    return round_up(num, 8);
> >}
> >
> >size_t rnup8_b(size_t num) {
> >    return roundup(num, 8);
> >}
> >
> >dru8:
> >        lea     rax, [rdi+7]
> >        and     rax, -8
> >        ret
> >rnup8_a:
> >        lea     rax, [rdi-1]
> >        or      rax, 7 <==== extra
> >        add     rax, 1
> >        ret
> >rnup8_b:
> >        lea     rax, [rdi+7]
> >        and     rax, -8
> >        ret
> >
> >round_up has an extra instruction
> >--------------------------
> >
> >I suspect the improvements are more likely to be related to caches,
> >system load, server load, ...
> >You can try to use perf to make a flamegraph and compare.
> >
> >Cheers,



-- 
Thanks,

Steve