From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C08E7C433FE for ; Fri, 15 Oct 2021 10:06:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id ABBC760C41 for ; Fri, 15 Oct 2021 10:06:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237635AbhJOKJB (ORCPT ); Fri, 15 Oct 2021 06:09:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237590AbhJOKIl (ORCPT ); Fri, 15 Oct 2021 06:08:41 -0400 Received: from mail-qt1-x833.google.com (mail-qt1-x833.google.com [IPv6:2607:f8b0:4864:20::833]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 33831C061570 for ; Fri, 15 Oct 2021 03:06:27 -0700 (PDT) Received: by mail-qt1-x833.google.com with SMTP id r17so8273554qtx.10 for ; Fri, 15 Oct 2021 03:06:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vanderster.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=VsHqj2SqoJqQfWZGHVxQnixGTb2psMqZmSy6d8rqTjc=; b=G0EfHfYo5rjG2xoxWp4XaQaj6kz1emSfwrCSXPGz4mhnnKc0wmg9TgEN0frtjRJIc+ xCtmot6qaEWW2k2qn/1zdjc+zT4KNN8LxWPmC6Cqz6RDKnRIJeM8v1GAJjJoF1fpXcD8 eeGbbD1MY9pjp4M4c97ALffddSJHawpksZ9Es= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=VsHqj2SqoJqQfWZGHVxQnixGTb2psMqZmSy6d8rqTjc=; b=TpzK29bQVP3uw7zvNTcVZ8lT3OO+BIXR9wydeckoCeCeMmWJz0hgprTQ9LvqN1x69M tKHwTrl9MQmQ3JzvuA5PyLjmxH80IMolDXlxtqYrK4wfEbxNgkWcIMTWHRPAlpE38JEc OmhPR8XpU+jUVvx6tOAW8CW6aizAmCdPmNR679dFu6xr2lgeRVqzrGOSFpFoJKxQ3+iT gRUu4bLqS3DQHtq580Der2Zev+38o2Oc8zvDzM2XhLOYiHdVhVtIWc6wrWj88hcRswq3 +ger/e44kXEku5Hxz/BN8U1qGdIU/7mdUslnBgnM6dWxKcX1NvKaQ6ExXFIAxlmbXtHn YKzg== X-Gm-Message-State: AOAM5335zceLLOCi3saF8Zr/H/nBwtner7eEtursXYmiCVDAHbhkmghM 7dsiRVMINiiIoxxi2W9D+5qxxiOxjlBOYOxE X-Google-Smtp-Source: ABdhPJxxGJoq3d2dYWo6wnIRbm1qbRHZ+3rkWxJ7vt5p19UkTWT3AKmcvWyuylfsc1H+WWy8UgbsGw== X-Received: by 2002:a05:622a:170b:: with SMTP id h11mr739007qtk.395.1634292386117; Fri, 15 Oct 2021 03:06:26 -0700 (PDT) Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com. [209.85.219.42]) by smtp.gmail.com with ESMTPSA id b3sm2442008qkj.76.2021.10.15.03.06.25 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 15 Oct 2021 03:06:25 -0700 (PDT) Received: by mail-qv1-f42.google.com with SMTP id o13so5391816qvm.4 for ; Fri, 15 Oct 2021 03:06:25 -0700 (PDT) X-Received: by 2002:a05:6214:509a:: with SMTP id kk26mr10086142qvb.65.1634292384942; Fri, 15 Oct 2021 03:06:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dan van der Ster Date: Fri, 15 Oct 2021 12:05:51 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: CephFS optimizated for machine learning workload To: "Yan, Zheng" Cc: ceph-devel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org Hi Zheng, Thanks for this really nice set of PRs -- we will try them at our site the next weeks and try to come back with practical feedback. A few questions: 1. How many clients did you scale to, with improvements in the 2nd PR? 2. Do these PRs improve the process of scaling up/down the number of active MDS? Thanks! Dan On Wed, Sep 15, 2021 at 9:21 AM Yan, Zheng wrote: > > Following PRs are optimization we (Kuaishou) made for machine learning > workloads (randomly read billions of small files) . > > [1] https://github.com/ceph/ceph/pull/39315 > [2] https://github.com/ceph/ceph/pull/43126 > [3] https://github.com/ceph/ceph/pull/43125 > > The first PR adds an option that disables dirfrag prefetch. When files > are accessed randomly, dirfrag prefetch adds lots of useless files to > cache and causes cache thrash. Performance of MDS can be dropped below > 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval > request to rados for cache missed lookup. Single mds can handle about > 6k cache missed lookup requests per second (all ssd metadata pool). > > The second PR optimizes MDS performance for a large number of clients > and a large number of read-only opened files. It also can greatly > reduce mds recovery time for read-mostly wordload. > > The third PR makes MDS cluster randomly distribute all dirfrags. MDS > uses consistent hash to calculate target rank for each dirfrag. > Compared to dynamic balancer and subtree pin, metadata can be > distributed among MDSs more evenly. Besides, MDS only migrates single > dirfrag (instead of big subtree) for load balancing. So MDS has > shorter pause when doing metadata migration. The drawbacks of this > change are: stat(2) directory can be slow; rename(2) file to > different directory can be slow. The reason is, with random dirfrag > distribution, these operations likely involve multiple MDS. > > Above three PRs are all merged into an integration branch > https://github.com/ukernel/ceph/tree/wip-mds-integration. > > We (Kuaishou) have run these codes for months, 16 active MDS cluster > serve billions of small files. In file random read test, single MDS > can handle about 6k ops, performance increases linearly with the > number of active MDS. In file creation test (mpirun -np 160 -host > xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active > MDS can serve over 100k file creation per second. > > Yan, Zheng