From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6DEE9C77B73 for ; Thu, 18 May 2023 20:04:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230374AbjERUEd (ORCPT ); Thu, 18 May 2023 16:04:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230375AbjERUEL (ORCPT ); Thu, 18 May 2023 16:04:11 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C0211172D for ; Thu, 18 May 2023 13:03:43 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 3B05B651F0 for ; Thu, 18 May 2023 20:03:43 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9B99CC433EF for ; Thu, 18 May 2023 20:03:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1684440222; bh=nKITgJy7liqGSAvU1qV/vaHdz2NbzxKAJNfHBMK/A6I=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=oJGyPRK5eH+do4d5a06JYGcoZr+p2ZGGjmK2JPVjQF01jinGz9WidVHlZJQOxtHCe 1BLbheNVoVv9+vdGFC3ag8vqOQ8na/QlTVort+sU0r4gIdghZgYy04Q8hQ2a8h05RL 8qcD0BeXuBkwYUzImXnVh90Yk7+b5hopu7w8HN/ABbzqpsztrZgB4YGK+rW051vIBP lzNqHJxBU1q/r/dGS6TBe5FZ4W6B1TZV2pmq8v9xmyH3ddGIyXdyk1fnvgRwaoyIGU nKG35Ak2bLdM9KzkNopyEUmWJ7QVtYI2XUr44KCtnbTKL3Ihfy+07/zJ0XV2h9Momp G+BnSHzTgrxPw== Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-2af1ae3a21fso9902681fa.0 for ; Thu, 18 May 2023 13:03:42 -0700 (PDT) X-Gm-Message-State: AC+VfDy9IrAxTP1SKlzVjfWDtOd4+7DBrA+1g7Sq2iIGrfiWqBAMlPBC yzg1h0c/PqK2/IR3TWcrWpMAQQDWUDwpc/HGmOE= X-Google-Smtp-Source: ACHHUZ6LHtdqfkUOrsOnRv3P3RbUrmSu+MY8KwnFonflNO7p7DZ96HkoRvCpFA9UfbfGtazjQPHdV4vQH2HlAr1LX4Y= X-Received: by 2002:a19:ad02:0:b0:4f3:7b3c:2e16 with SMTP id t2-20020a19ad02000000b004f37b3c2e16mr35252lfc.39.1684440220668; Thu, 18 May 2023 13:03:40 -0700 (PDT) MIME-Version: 1.0 References: <20230308094106.227365-1-rppt@kernel.org> <20230308094106.227365-2-rppt@kernel.org> <20230518152354.GD4967@kernel.org> In-Reply-To: From: Song Liu Date: Thu, 18 May 2023 13:03:28 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH 1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc() To: Kent Overstreet Cc: Mike Rapoport , linux-mm@kvack.org, Andrew Morton , Dave Hansen , Peter Zijlstra , Rick Edgecombe , Thomas Gleixner , Vlastimil Babka , linux-kernel@vger.kernel.org, x86@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 18, 2023 at 12:15=E2=80=AFPM Kent Overstreet wrote: > > On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote: > > On Thu, May 18, 2023 at 11:47=E2=80=AFAM Song Liu wro= te: > > > > > > On Thu, May 18, 2023 at 10:24=E2=80=AFAM Kent Overstreet > > > wrote: > > > > > > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > > > > > On Thu, May 18, 2023 at 9:48=E2=80=AFAM Kent Overstreet > > > > > wrote: > > > > > > > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > > > > I am working on patches based on the discussion in [1]. I am = planning to > > > > > > > send v1 for review in a week or so. > > > > > > > > > > > > Hey Song, I was reviewing that thread too, > > > > > > > > > > > > Are you taking a different approach based on Thomas's feedback?= I think > > > > > > he had some fair points in that thread. > > > > > > > > > > Yes, the API is based on Thomas's suggestion, like 90% from the d= iscussions. > > > > > > > > > > > > > > > > > My own feeling is that the buddy allocator is our tool for allo= cating > > > > > > larger variable sized physically contiguous allocations, so I'd= like to > > > > > > see something based on that - I think we could do a hybrid budd= y/slab > > > > > > allocator approach, like we have for regular memory allocations= . > > > > > > > > > > I am planning to implement the allocator based on this (reuse > > > > > vmap_area logic): > > > > > > > > Ah, you're still doing vmap_area approach. > > > > > > > > Mike's approach looks like it'll be _much_ lighter weight and highe= r > > > > performance, to me. vmalloc is known to be slow compared to the bud= dy > > > > allocator, and with Mike's approach we're only modifying mappings o= nce > > > > per 2 MB chunk. > > > > > > > > I don't see anything in your code for sub-page sized allocations to= o, so > > > > perhaps I should keep going with my slab allocator. > > > > > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2= ], > > > we showed that multiple BPF programs share the same page with some > > > kernel text (_etext). > > > > > > > Could you share your thoughts on your approach vs. Mike's? I'm newe= r to > > > > this area of the code than you two so maybe there's an angle I've m= issed > > > > :) > > > > > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap > > > based solution. > > Tree based requires quite a bit of overhead for the rbtree pointers, and > additional vmap_area structs. > > With a buddy allocator based approach, there's no additional state that > needs to be allocated, since it all fits in struct page. > > > > First, for 2MiB page with 64B chunk size, we need a bitmap of > > > 2MiB / 64B =3D 32k bit =3D 4k bytes > > > While the tree based solution can adapt to the number of allocations = within > > > This 2MiB page. Also, searching a free range within 4kB of bitmap may > > > actually be slower than searching in the tree. > > > > > > Second, bitmap based solution cannot handle > 2MiB allocation cleanly= , > > > while tree based solution can. For example, if a big driver uses 3MiB= , the > > > tree based allocator can allocate 4MiB for it, and use the rest 1MiB = for > > > smaller allocations. > > We're not talking about a bitmap based solution for >=3D PAGE_SIZE > allocations, the alternative is a buddy allocator - so no searching, > just per power-of-two freelists. > > > > > Missed one: > > > > Third, bitmap based solution requires a "size" parameter in free(). It = is an > > overhead for the user. Tree based solution doesn't have this issue. > > No, we can recover the size of the allocation via compound_order() - > hasn't historically been done for alloc_pages() allocations to avoid > setting up the state in each page for compound head/tail, but it perhaps > should be (and is with folios, which we've generally been switching to). If we use compound_order(), we will round up to power of 2 for all allocations. Does this mean we will use 4MiB for a 2.1MiB allocation? Thanks, Song