From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1A2AC433E0 for ; Wed, 23 Dec 2020 12:13:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8369B204EF for ; Wed, 23 Dec 2020 12:13:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728595AbgLWMMv (ORCPT ); Wed, 23 Dec 2020 07:12:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42198 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728409AbgLWMMv (ORCPT ); Wed, 23 Dec 2020 07:12:51 -0500 Received: from mail-lf1-x12f.google.com (mail-lf1-x12f.google.com [IPv6:2a00:1450:4864:20::12f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70016C061793 for ; Wed, 23 Dec 2020 04:12:10 -0800 (PST) Received: by mail-lf1-x12f.google.com with SMTP id m12so39599279lfo.7 for ; Wed, 23 Dec 2020 04:12:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=/wIYMsJQZ3iQsBLuaM3BKXb1BHAX1SBVB27T+5mB2YE=; b=by/eFrNKLecUUvgYEmFr+diz0SlI/quL8lPzJ/YxvQYp0kjYupCwPospTA7Y5qNJmN 6yb30XMZLcQ/iyxdEruILIxjEbFW4gQ31OflZaJASF4XAKAW+iIhrR7XMC51F3QDvv0z X9yaOlhOzwmn2oWlCVY4U4t42uWkdbaqFty0ACEP9FpkONaYC2qqKNxyg+lS/3OI/EQ+ rUa5KdN6jxJdkSTGUet0xlCWKk8UdFGcWosIHmEOymSkWDnOyUE32CPiUt6N6odjv43A Yzh2xQEHvF4sWXB+vjOhxPVsW/gx9lGnmEHaWiD2qhN5o5xZQIj4Als/dT1zOjVL4O9s oZUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=/wIYMsJQZ3iQsBLuaM3BKXb1BHAX1SBVB27T+5mB2YE=; b=Nl7X34j+lstpThwnkCHIQc4EgC1SsV7Xrof6X1QK3z1jwjKSMrFAyejPOoo6cInB0z Zac/frvXed+dfV1O1BqpPXZOgrqs7IBQz8fMOPRRxexRJKdJGlQd4C5sB2tjvq/EwcOC pqM55kLOvnulkc8Fcbm5esBUXtYeUkFJncY5O13OTLkWAjogt95kD9TaKal/JCWF+XXY xE4+eaM62nXiQ/mLSdkSk3NWUxBI6NEaE2tS+qwz31Zh9/slrM/XpdChQdQE0Dyb6P9R JrKoKloO0mX0ruH9UMaAUJPPbcHaVG9mdzK7zVfFcm5Ptxj5bhAtoeUdSCPkzTIOtSuf XXeQ== X-Gm-Message-State: AOAM530Uxiint3HTyFy43ykmuaD5hhuhHKKc5ZydnsgzwMUi0AXnIjgB cU+pBaScBqGyarlE7UsBaszMDot3MEjxLMf67UE= X-Google-Smtp-Source: ABdhPJxu0iGpRpL6oyLN/jLqLsq4DbkuVOieU+xjWhjSOyJi5zSINrSFL3JcdP4Hj/0Gdo7ocnDeRvnWTdQ+ZU0iWyI= X-Received: by 2002:a05:6512:1112:: with SMTP id l18mr10219334lfg.538.1608725528976; Wed, 23 Dec 2020 04:12:08 -0800 (PST) MIME-Version: 1.0 References: <20201221162519.GA22504@open-light-1.localdomain> <7bf0e895-52d6-9e2d-294b-980c33cf08e4@redhat.com> <840ff69d-20d5-970a-1635-298000196f3e@redhat.com> <55052a91-64f9-b343-a1c4-f059ca50ecf3@redhat.com> In-Reply-To: <55052a91-64f9-b343-a1c4-f059ca50ecf3@redhat.com> From: Liang Li Date: Wed, 23 Dec 2020 20:11:57 +0800 Message-ID: Subject: Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO To: David Hildenbrand Cc: Alexander Duyck , Mel Gorman , Andrew Morton , Andrea Arcangeli , Dan Williams , "Michael S. Tsirkin" , Jason Wang , Dave Hansen , Michal Hocko , Liang Li , linux-mm , LKML , virtualization@lists.linux-foundation.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 23, 2020 at 4:41 PM David Hildenbrand wrote: > > [...] > > >> I was rather saying that for security it's of little use IMHO. > >> Application/VM start up time might be improved by using huge pages (and > >> pre-zeroing these). Free page reporting might be improved by using > >> MADV_FREE instead of MADV_DONTNEED in the hypervisor. > >> > >>> this feature, above all of them, which one is likely to become the > >>> most strong one? From the implementation, you will find it is > >>> configurable, users don't want to use it can turn it off. This is not > >>> an option? > >> > >> Well, we have to maintain the feature and sacrifice a page flag. For > >> example, do we expect someone explicitly enabling the feature just to > >> speed up startup time of an app that consumes a lot of memory? I highly > >> doubt it. > > > > In our production environment, there are three main applications have such > > requirement, one is QEMU [creating a VM with SR-IOV passthrough device], > > anther other two are DPDK related applications, DPDK OVS and SPDK vhost, > > for best performance, they populate memory when starting up. For SPDK vhost, > > we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for > > vhost 'live' upgrade, which is done by killing the old process and > > starting a new > > one with the new binary. In this case, we want the new process started as quick > > as possible to shorten the service downtime. We really enable this feature > > to speed up startup time for them :) > > Thanks for info on the use case! > > All of these use cases either already use, or could use, huge pages > IMHO. It's not your ordinary proprietary gaming app :) This is where > pre-zeroing of huge pages could already help. You are welcome. For some historical reason, some of our services are not using hugetlbfs, that is why I didn't start with hugetlbfs. > Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... > creating a file and pre-zeroing it from another process, or am I missing > something important? At least for QEMU this should work AFAIK, where you > can just pass the file to be use using memory-backend-file. > If using another process to create a file, we can offload the overhead to another process, and there is no need to pre-zeroing it's content, just populating the memory is enough. If we do it that way, then how to determine the size of the file? it depends on the RAM size of the VM the customer buys. Maybe we can create a file large enough in advance and truncate it to the right size just before the VM is created. Then, how many large files should be created on a host? You will find there are a lot of things that have to be handled properly. I think it's possible to make it work well, but we will transfer the management complexity to up layer components. It's a bad practice to let upper layer components process such low level details which should be handled in the OS layer. > > > >> I'd love to hear opinions of other people. (a lot of people are offline > >> until beginning of January, including, well, actually me :) ) > > > > OK. I will wait some time for others' feedback. Happy holidays! > > To you too, cheers! > I have to work at least two months before the vacation. :( Liang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D30ADC433E0 for ; Wed, 23 Dec 2020 12:12:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4600F223E8 for ; Wed, 23 Dec 2020 12:12:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4600F223E8 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AC6948D0025; Wed, 23 Dec 2020 07:12:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A75DA8D0001; Wed, 23 Dec 2020 07:12:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 965F28D0025; Wed, 23 Dec 2020 07:12:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com [216.40.44.241]) by kanga.kvack.org (Postfix) with ESMTP id 812038D0001 for ; Wed, 23 Dec 2020 07:12:11 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 447AE180AD820 for ; Wed, 23 Dec 2020 12:12:11 +0000 (UTC) X-FDA: 77624434062.13.mine76_4704b4527468 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id 2191C18140B75 for ; Wed, 23 Dec 2020 12:12:11 +0000 (UTC) X-HE-Tag: mine76_4704b4527468 X-Filterd-Recvd-Size: 6671 Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com [209.85.167.54]) by imf06.hostedemail.com (Postfix) with ESMTP for ; Wed, 23 Dec 2020 12:12:10 +0000 (UTC) Received: by mail-lf1-f54.google.com with SMTP id o19so39668899lfo.1 for ; Wed, 23 Dec 2020 04:12:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=/wIYMsJQZ3iQsBLuaM3BKXb1BHAX1SBVB27T+5mB2YE=; b=by/eFrNKLecUUvgYEmFr+diz0SlI/quL8lPzJ/YxvQYp0kjYupCwPospTA7Y5qNJmN 6yb30XMZLcQ/iyxdEruILIxjEbFW4gQ31OflZaJASF4XAKAW+iIhrR7XMC51F3QDvv0z X9yaOlhOzwmn2oWlCVY4U4t42uWkdbaqFty0ACEP9FpkONaYC2qqKNxyg+lS/3OI/EQ+ rUa5KdN6jxJdkSTGUet0xlCWKk8UdFGcWosIHmEOymSkWDnOyUE32CPiUt6N6odjv43A Yzh2xQEHvF4sWXB+vjOhxPVsW/gx9lGnmEHaWiD2qhN5o5xZQIj4Als/dT1zOjVL4O9s oZUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=/wIYMsJQZ3iQsBLuaM3BKXb1BHAX1SBVB27T+5mB2YE=; b=lTNlOS4g22xefADVhInIung5h1xup5vftstIruQ/TcIFebdgPx7/r0GpMPaaDm/HNG wdPHcDHFuijBiZOzvmi6xDbjmAymImgSZmMJj9aPEza+Mxbd4pks0fUpnNz+NSKNY7NV NqzwCLi9z6l6O2l5WWspRptN8wLdjvaha86Zf4HEpBAF1wBYRnGZu/Tr7U0oihw8agK7 bqyafVGsVHGnHUhCnSf0WBGpRsV/QujmB3Pv8IGnl9ydgzDkZGZZ6zbx2GGRklscclNR twBRilgYi+UO1+xhHx5iPxaBjsd2tQesX+UwctQhjlcjeddHM0XyMpbEzF2BNQmC/mD2 tlMg== X-Gm-Message-State: AOAM530L1A9Ty/MgOTf3dRK2sdagaV0nem8o/Vap21Qf+D2HVzP/p5Ft t1DbRbi7QgbLENyxYxq96BA2uGTYBMYspu+zJpA= X-Google-Smtp-Source: ABdhPJxu0iGpRpL6oyLN/jLqLsq4DbkuVOieU+xjWhjSOyJi5zSINrSFL3JcdP4Hj/0Gdo7ocnDeRvnWTdQ+ZU0iWyI= X-Received: by 2002:a05:6512:1112:: with SMTP id l18mr10219334lfg.538.1608725528976; Wed, 23 Dec 2020 04:12:08 -0800 (PST) MIME-Version: 1.0 References: <20201221162519.GA22504@open-light-1.localdomain> <7bf0e895-52d6-9e2d-294b-980c33cf08e4@redhat.com> <840ff69d-20d5-970a-1635-298000196f3e@redhat.com> <55052a91-64f9-b343-a1c4-f059ca50ecf3@redhat.com> In-Reply-To: <55052a91-64f9-b343-a1c4-f059ca50ecf3@redhat.com> From: Liang Li Date: Wed, 23 Dec 2020 20:11:57 +0800 Message-ID: Subject: Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO To: David Hildenbrand Cc: Alexander Duyck , Mel Gorman , Andrew Morton , Andrea Arcangeli , Dan Williams , "Michael S. Tsirkin" , Jason Wang , Dave Hansen , Michal Hocko , Liang Li , linux-mm , LKML , virtualization@lists.linux-foundation.org Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 23, 2020 at 4:41 PM David Hildenbrand wrote: > > [...] > > >> I was rather saying that for security it's of little use IMHO. > >> Application/VM start up time might be improved by using huge pages (and > >> pre-zeroing these). Free page reporting might be improved by using > >> MADV_FREE instead of MADV_DONTNEED in the hypervisor. > >> > >>> this feature, above all of them, which one is likely to become the > >>> most strong one? From the implementation, you will find it is > >>> configurable, users don't want to use it can turn it off. This is not > >>> an option? > >> > >> Well, we have to maintain the feature and sacrifice a page flag. For > >> example, do we expect someone explicitly enabling the feature just to > >> speed up startup time of an app that consumes a lot of memory? I highly > >> doubt it. > > > > In our production environment, there are three main applications have such > > requirement, one is QEMU [creating a VM with SR-IOV passthrough device], > > anther other two are DPDK related applications, DPDK OVS and SPDK vhost, > > for best performance, they populate memory when starting up. For SPDK vhost, > > we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for > > vhost 'live' upgrade, which is done by killing the old process and > > starting a new > > one with the new binary. In this case, we want the new process started as quick > > as possible to shorten the service downtime. We really enable this feature > > to speed up startup time for them :) > > Thanks for info on the use case! > > All of these use cases either already use, or could use, huge pages > IMHO. It's not your ordinary proprietary gaming app :) This is where > pre-zeroing of huge pages could already help. You are welcome. For some historical reason, some of our services are not using hugetlbfs, that is why I didn't start with hugetlbfs. > Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... > creating a file and pre-zeroing it from another process, or am I missing > something important? At least for QEMU this should work AFAIK, where you > can just pass the file to be use using memory-backend-file. > If using another process to create a file, we can offload the overhead to another process, and there is no need to pre-zeroing it's content, just populating the memory is enough. If we do it that way, then how to determine the size of the file? it depends on the RAM size of the VM the customer buys. Maybe we can create a file large enough in advance and truncate it to the right size just before the VM is created. Then, how many large files should be created on a host? You will find there are a lot of things that have to be handled properly. I think it's possible to make it work well, but we will transfer the management complexity to up layer components. It's a bad practice to let upper layer components process such low level details which should be handled in the OS layer. > > > >> I'd love to hear opinions of other people. (a lot of people are offline > >> until beginning of January, including, well, actually me :) ) > > > > OK. I will wait some time for others' feedback. Happy holidays! > > To you too, cheers! > I have to work at least two months before the vacation. :( Liang