Re: [PATCH 4/4] xen: use idle vcpus to scrub pages

From: Bob Liu <bob.liu@oracle.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Bob Liu <lliubbo@gmail.com>,
	keir@xen.org, ian.campbell@citrix.com,
	George.Dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	xen-devel@lists.xenproject.org
Subject: Re: [PATCH 4/4] xen: use idle vcpus to scrub pages
Date: Wed, 18 Jun 2014 09:18:06 +0800	[thread overview]
Message-ID: <53A0E8CE.7050304@oracle.com> (raw)
In-Reply-To: <53A0583B020000780001B0E0@mail.emea.novell.com>

On 06/17/2014 09:01 PM, Jan Beulich wrote:
>>>> On 17.06.14 at 13:49, <lliubbo@gmail.com> wrote:
>> In case of heavy lock contention, use a percpu list.
>>  - Delist a batch of pages to a percpu list from "scrub" free page list.
>>  - Scrub pages on this percpu list.
>>  - Add those clean pages to normal "head" free page list, merge with other
> 
> Did you mean "heap" here?
> 

Yes, sorry!

>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -587,12 +587,14 @@ int domain_kill(struct domain *d)
>>          d->tmem_client = NULL;
>>          /* fallthrough */
>>      case DOMDYING_dying:
>> +        enable_idle_scrub = 0;
>>          rc = domain_relinquish_resources(d);
>>          if ( rc != 0 )
>>          {
>>              BUG_ON(rc != -EAGAIN);
>>              break;
>>          }
>> +        enable_idle_scrub = 1;
>>          for_each_vcpu ( d, v )
>>              unmap_vcpu_info(v);
>>          d->is_dying = DOMDYING_dead;
> 
> ???

enable_idle_scrub is a rough way to disable idle vcpu scrubbing, so that
domain_relinquish_resources() can get "&heap_lock" faster and make xl
destroy return quickly.

> 
>> --- a/xen/common/page_alloc.c
>> +++ b/xen/common/page_alloc.c
>> @@ -79,6 +79,9 @@ PAGE_LIST_HEAD(page_offlined_list);
>>  /* Broken page list, protected by heap_lock. */
>>  PAGE_LIST_HEAD(page_broken_list);
>>  
>> +volatile bool_t enable_idle_scrub;
>> +DEFINE_PER_CPU(struct page_list_head, scrub_list_cpu);
> 
> I'm pretty sure I pointed out to you before that variables used only in
> a single file should be static.
> 

Sorry, again..

>> @@ -1387,7 +1390,86 @@ void __init scrub_heap_pages(void)
>>      setup_low_mem_virq();
>>  }
>>  
>> +#define SCRUB_BATCH 1024
>> +void scrub_free_pages(void)
>> +{
>> +    struct page_info *pg;
>> +    unsigned int i, j, node_empty = 0, nr_delisted = 0;
>> +    int order;
>> +    unsigned int cpu = smp_processor_id();
>> +    unsigned int node = cpu_to_node(cpu);
>> +    struct page_list_head *temp_list = &this_cpu(scrub_list_cpu);
>> +
>> +    if ( !enable_idle_scrub )
>> +        return;
>> +
>> +    do
>> +    {
>> +        if ( page_list_empty(temp_list) )
>> +        {
>> +            /* Delist a batch of pages from global scrub list */
>> +            spin_lock(&heap_lock);
>> +            for ( j = 0; j < NR_ZONES; j++ )
>> +            {
>> +                for ( order = MAX_ORDER; order >= 0; order-- )
>> +                {
>> +                    if ( (pg = page_list_remove_head(&scrub(node, j, order))) )
>> +                    {
>> +                        for ( i = 0; i < (1 << order); i++)
>> +                            mark_page_offline(&pg[i], 0);
> 
> A page previously having caused #MC now suddenly became healthy
> again?
> 

Will be fixed.

>> +
>> +                        page_list_add_tail(pg, temp_list);
>> +                        nr_delisted += (1 << order);
>> +                        if ( nr_delisted > SCRUB_BATCH )
>> +                        {
>> +                            nr_delisted = 0;
>> +                            spin_unlock(&heap_lock);
>> +                            goto start_scrub;
>> +                        }
>> +                    }
>> +                }
>> +            }
>> +
>> +            node_empty = 1;
>> +            spin_unlock(&heap_lock);
>> +        }
>> +        else
>> +        {
>> +start_scrub:
> 
> Labels need to be indented by at least one space.
> 
>> +            /* Scrub percpu list */
>> +            while ( !page_list_empty(temp_list) )
>> +            {
>> +                pg = page_list_remove_head(temp_list);
>> +                ASSERT(pg);
>> +                order = PFN_ORDER(pg);
>> +                for ( i = 0; i < (1 << order); i++ )
>> +                {
> 
> Order here can again be almost arbitrarily high. You aren't allowed
> to scrub e.g. 4Tb in one go.
> 

I can add a softirq_pending(cpu) check after each scrub_one_page.

>> +                    ASSERT( test_bit(_PGC_need_scrub, &(pg[i].count_info)) );
>> +                    scrub_one_page(&pg[i]);
>> +                    pg[i].count_info &= ~(PGC_need_scrub);
>> +                }
>>  
>> +                /* Add pages to free heap list */
>> +                spin_lock(&heap_lock);
> 
> Between the dropping of the lock above and the re-acquiring of it
> here the pages "stolen" can lead to allocation failures despite there
> being available memory. This needs to be avoided if at all possible.
> 

#define SCRUB_BATCH 1024 is used to limit the number of pages in percpu
list.
But yes, in the worst case(order=20) 4Tb will be invisible. I don't have
a good solution right now.

>> +                for ( i = 0; i < (1 << order); i++ )
>> +                {
>> +                    ASSERT ( !test_bit(_PGC_need_scrub, &(pg[i].count_info)) );
>> +                    pg[i].count_info |= PGC_state_free;
>> +                }
>> +                ASSERT (node == phys_to_nid(page_to_maddr(pg)));
>> +                merge_free_trunks(pg, order, 0);
>> +                spin_unlock(&heap_lock);
>> +
>> +                if ( softirq_pending(cpu) )
>> +                    return;
>> +            }
>> +        }
>> +
>> +        /* Scrub list of this node is empty */
>> +        if ( node_empty )
>> +            return;
>> +    } while ( !softirq_pending(cpu) );
>> +}
> 
> And all the logic needs to be made NUMA-aware, i.e. a CPU should
> prefer to scrub local memory, and only resort to scrubbing distant
> memory if no CPU on the correct node is idle. Plus (as we have seen
> with Konrad's boot time scrubbing changes) you should avoid having
> two hyperthreads within the same core doing scrubbing at the same
> time.
> 

Okay, will be taken into account.

> In the end I'm getting the impression that this all wasn't properly
> thought through yet. Convoys on the lock inside the idle processing
> should e.g. also be avoided (or reduced to a minimum) - there's no
> point preventing the entering of C states if all you're otherwise going
> to do is spin on a lock.
> 

I already use percpu list to reduce spin lock. node_empty can be
extended to avoid the case your mentioned.
Thank you very much for all your review.

-Bob