All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yakov Lerner <iler.ml@gmail.com>
To: Stephen Hemminger <shemminger@vyatta.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
	netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: [PATCH] /proc/net/tcp, overhead removed
Date: Tue, 29 Sep 2009 20:34:51 +0300	[thread overview]
Message-ID: <f36b08ee0909291034n592664b4r9eab63630173493b@mail.gmail.com> (raw)
In-Reply-To: <20090929084534.41274f66@nehalam>

On Tue, Sep 29, 2009 at 18:45, Stephen Hemminger <shemminger@vyatta.com> wrote:
> On Tue, 29 Sep 2009 11:55:18 +0300
> Yakov Lerner <iler.ml@gmail.com> wrote:
>
>> On Tue, Sep 29, 2009 at 10:56, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Yakov Lerner a écrit :
>> > > Take 2.
>> > >
>> > > "Sharp improvement in performance of /proc/net/tcp when number of
>> > > sockets is large and hashsize is large.
>> > > O(numsock * hashsize) time becomes O(numsock + hashsize). On slow
>> > > processors, speed difference can be x100 and more."
>> > >
>> > > I must say that I'm not fully satisfied with my choice of "st->sbucket"
>> > > for the new preserved index. The better name would be "st->snum".
>> > > Re-using "st->sbucket" saves 4 bytes, and keeps the patch to one sourcefile.
>> > > But "st->sbucket" has different meaning in OPENREQ and LISTEN states;
>> > > this can be confusing.
>> > > Maybe better add "snum" member to struct tcp_iter_state ?
>> > >
>> > > Shall I change subject when sending "take N+1", or keep the old subject ?
>> > >
>> > > Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
>> > > ---
>> > >  net/ipv4/tcp_ipv4.c |   35 +++++++++++++++++++++++++++++++++--
>> > >  1 files changed, 33 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> > > index 7cda24b..e4c4f19 100644
>> > > --- a/net/ipv4/tcp_ipv4.c
>> > > +++ b/net/ipv4/tcp_ipv4.c
>> > > @@ -1994,13 +1994,14 @@ static inline int empty_bucket(struct tcp_iter_state *st)
>> > >               hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
>> > >  }
>> > >
>> > > -static void *established_get_first(struct seq_file *seq)
>> > > +static void *established_get_first_after(struct seq_file *seq, int bucket)
>> > >  {
>> > >       struct tcp_iter_state *st = seq->private;
>> > >       struct net *net = seq_file_net(seq);
>> > >       void *rc = NULL;
>> > >
>> > > -     for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) {
>> > > +     for (st->bucket = bucket; st->bucket < tcp_hashinfo.ehash_size;
>> > > +          ++st->bucket) {
>> > >               struct sock *sk;
>> > >               struct hlist_nulls_node *node;
>> > >               struct inet_timewait_sock *tw;
>> > > @@ -2010,6 +2011,8 @@ static void *established_get_first(struct seq_file *seq)
>> > >               if (empty_bucket(st))
>> > >                       continue;
>> > >
>> > > +             st->sbucket = st->num;
>> > > +
>> > >               spin_lock_bh(lock);
>> > >               sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
>> > >                       if (sk->sk_family != st->family ||
>> > > @@ -2036,6 +2039,11 @@ out:
>> > >       return rc;
>> > >  }
>> > >
>> > > +static void *established_get_first(struct seq_file *seq)
>> > > +{
>> > > +     return established_get_first_after(seq, 0);
>> > > +}
>> > > +
>> > >  static void *established_get_next(struct seq_file *seq, void *cur)
>> > >  {
>> > >       struct sock *sk = cur;
>> > > @@ -2064,6 +2072,9 @@ get_tw:
>> > >               while (++st->bucket < tcp_hashinfo.ehash_size &&
>> > >                               empty_bucket(st))
>> > >                       ;
>> > > +
>> > > +             st->sbucket = st->num;
>> > > +
>> > >               if (st->bucket >= tcp_hashinfo.ehash_size)
>> > >                       return NULL;
>> > >
>> > > @@ -2107,6 +2118,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
>> > >
>> > >       if (!rc) {
>> > >               st->state = TCP_SEQ_STATE_ESTABLISHED;
>> > > +             st->sbucket = 0;
>> > >               rc        = established_get_idx(seq, pos);
>> > >       }
>> > >
>> > > @@ -2116,6 +2128,25 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
>> > >  static void *tcp_seq_start(struct seq_file *seq, loff_t *pos)
>> > >  {
>> > >       struct tcp_iter_state *st = seq->private;
>> > > +
>> > > +     if (*pos && *pos >= st->sbucket &&
>> > > +         (st->state == TCP_SEQ_STATE_ESTABLISHED ||
>> > > +          st->state == TCP_SEQ_STATE_TIME_WAIT)) {
>> > > +             void *cur;
>> > > +             int nskip;
>> > > +
>> > > +             /* for states estab and tw, st->sbucket is index (*pos) */
>> > > +             /* corresponding to the beginning of bucket st->bucket */
>> > > +
>> > > +             st->num = st->sbucket;
>> > > +             /* jump to st->bucket, then skip (*pos - st->sbucket) items */
>> > > +             st->state = TCP_SEQ_STATE_ESTABLISHED;
>> > > +             cur = established_get_first_after(seq, st->bucket);
>> > > +             for (nskip = *pos - st->num; cur && nskip > 0; --nskip)
>> > > +                     cur = established_get_next(seq, cur);
>> > > +             return cur;
>> > > +     }
>> > > +
>> > >       st->state = TCP_SEQ_STATE_LISTENING;
>> > >       st->num = 0;
>> > >       return *pos ? tcp_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
>> >
>> > Just in case you are working on "take 3" of the patch, there is a fondamental problem.
>> >
>> > All the scalability problems come from the fact that tcp_seq_start()
>> > *has* to rescan all the tables from the begining, because of lseek() capability
>> > on /proc/net/tcp file
>> >
>> > We probably could disable llseek() (on other positions than start of the file),
>> > and rely only on internal state (listening/established hashtable, hash bucket, position in chain)
>> >
>> > I cannot imagine how an application could rely on lseek() on >0 position in this file.
>>
>>
>> I thought  /proc/net/tcp  can  both  be fast and allow lseek;
>> (1) when no lseek was issued since last read
>> (we can detect this), /proc/net/tcp can jump to the
>> last known bucket (common case), vs
>> (2) switch to slow mode (scan from the beginning of hash)
>> when lseek was used , no ?
>
> If you look at fib_hash and fib_trie, they already do the same thing.
>  * fib_hash records last hash chain to avoid overhead of rescan.
>  * fib_trie records last route and does fast lookup to restart from there.

Thanks for the pointer.

  reply	other threads:[~2009-09-29 17:34 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-28 23:01 [PATCH] /proc/net/tcp, overhead removed Yakov Lerner
2009-09-29  4:39 ` Eric Dumazet
2009-09-29  7:56 ` Eric Dumazet
2009-09-29  8:55   ` Yakov Lerner
2009-09-29 15:45     ` Stephen Hemminger
2009-09-29 17:34       ` Yakov Lerner [this message]
  -- strict thread matches above, loose matches on Subject: below --
2009-09-26 21:31 Yakov Lerner
2009-09-26 21:31 ` Yakov Lerner
2009-09-27  9:53 ` Eric Dumazet
2009-09-28 22:10   ` Yakov Lerner
2009-09-28 22:20     ` Eric Dumazet
2009-09-28 23:24       ` Stephen Hemminger
2009-09-29  7:43         ` Yakov Lerner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f36b08ee0909291034n592664b4r9eab63630173493b@mail.gmail.com \
    --to=iler.ml@gmail.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=shemminger@vyatta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.