20050614 Tuesday June 14, 2005

Prefetching for Fun and Profit...

Prefetching for Fun and Profit...

Prefetching for Fun and Profit...


Now that OpenSolaris has been fully released into the wild, I thought I'd share a story of pain, misery and much satisfaction. (To the tune Brady Bunch) This is the story, of a lovely page_t... that refused to be inited fast and true (pa, da, da, ra, da). Okay... so it's painfully obvious that I'm not much for singing; but, since you no doubt already read that Solaris 10 boots much faster than previous releases of Solaris, you may want to stay tuned and find out about one of the contributors.

As Enterprise Class Systems have grown in both cpu and memory capacity, an ever increasing demand on system availability has also been imposed on such systems. As memory capacity grew, the Solaris kernel exhibited scalability and performance bottlenecks due to either out of date algorithms, unscalable kernel data structures, or both. One of the areas that exhibited an obvious need for improvement was the initialization of Virtual Memory (vm) structures needed to manage the physical memory in the system (aka. page_t's). As described in Sun bug 4765332, the main reason the kernel takes a disproportionate amount of time during vm structure initialization is the fact that at the beginning of the kernel's life, only one cpu is doing any work, regardless of how many cpu's are configured in the system domain. The initialization of the data structures is done by the startup() routine, which is the very first function called from the kernel's main().

From the code sniplet below, we can see that the initialization of the kernel vm data structures occurs way before the rest of the other system cpu's are brought online. Even when the other cpu's have been initialized, the scheduler doesn't kick off until sched().

void
main(void)
{
        proc_t          *p = ttoproc(curthread);        /* &p0 */
        int             (**initptr)();
        extern void     sched();

		/* ... */

        startup();

		/* ... */

        /*
         * Perform MP initialization, if any.
         */
        start_other_cpus(0);

		/* ... */

	sched();
}

History and Motivation...

Empirical data gathered on a StarFire 15k w/100GB of physical memory utilizing UltraSparc III+ (Cheetah+) processors with a clock speed of 900MHz, revealed that it took roughly 156 seconds to run thru the code sniplet above; that is main() -> sched(). Out of that time, about 15% of the time was consumed by the initialization of page_t's. If we start playing w/the numbers, we readily see that it takes us around 23.4 seconds to initialize all the page structures needed to manage the 100GB's. Let's go further and find out how much time we spend initializing each page_t...

  Time = 156 * 0.15 = 23.4 Secs

Memory = 100 * 1024 * 1024 * 1024 / 8k

       = 13107200 8k pages

   R        23.4 S
  ---  =  ------------
  1pg     13107200 pgs

    R  = 1.785 uS/pg

So, we know that we are initializing pages at a rate of 1.785 microseconds per page. However, if we extrapolate this data to a fully populated StarFire 15k w/576GB of memory (assuming 1GB DIMMs), we start to see how quickly the numbers add up; For a 576GB StarFire 15k, there are roughly 75497472 8k pages, give or take since OpenBoot (OBP) needs some of these for itself to do its own accounting. If we use the above data...

Memory = 576 * 1024 * 1024 * 1024 / 8k

       = 75497472 8k pages

   R   = 1.785 uS/pg

  Time = 1.785 uS/pg * 75497472 pgs = 134.8 Secs

/* That's more than 2 mins on this step alone ! */

For systems that require 4 9's availability, the allowed down time is (365 days/year * 24 hours/day * 60 mins/hour * 0.0001) or around 52.5 mins/year. If we account for other tasks such as...

  1. Collecting a system dump
  2. Additional steps in the boot process
    • Power On Self Test (POST)
    • OpenBoot Initialization (OBP)
    • I/O Device polling
    • Mission critical application startup (ie. Oracle)
  3. Additional steps during system shutdown
    • Flushing dirty pages when unmounting filesystems
    • Running thru memory clearing pages Unrecoverable Errors
  4. Any scheduled downtime
... spending 2 mins on just the page_t initialization, can consume a substantial amount of valuable down time; 3.84% to be exact for the above example, on each reboot.

"So, why is this Rick guy spending all this time boring us w/statistics and these so-called page_t's ?" you may be asking yourself. Well, it's simple really; the point is that by exploiting the hardware's prefetch capability and doing smarter initialization of the page structures, we get a big performance boost. Bug 4765332 details how the problem was addressed in a 3-Tier approach:

  • Realizing that the page_t initialization code is only ever executed when only one cpu is active, we remove any multi-threaded synchronization in these code paths to eliminate the lock overhead.
  • As part of the initialization, 8k pages are coalesced into larger (synthetic) pages, whose size is the largest mmu supported page size, by recognizing the alignment of the page being initialized.
  • HW prefetch instructions can be leveraged as the memory structures being initialized are contiguous in physical memory.

For the love of C, please get to the code...

We begin our travel thru the code by introducing three key data structures; the memlist structure, the memseg structure, and the much advertised page_t structure.

  • The memlist structure is used to describe contiguous ranges of configured physical memory in the system domain. Since domains can be sparsely built, that is with non-contiguous system boards, there can be gaps in the physical memory range. There are three key memlist's maintained by OBP and the Solaris kernel: phys_install, phys_avail and virt_avail.
  • The memseg structure is used to describe the number of pages contained within a contiguous chunk of memory, and their assigned page frame numbers.
  • The page_t structure is used by the vm subsystem to describe and manage an 8k region of physical memory in the system.

Now that we know the key players, we delve into the flow of how the system gets to the point where it initializes the page_t's. We begin by taking a 10k foot level view of the startup code, and then hone in on the key routines.

_mlsetup()
    main()
        startup()
	    startup_memlist()
	        kphysm_init()		<--- page_t initialization
	            add_physmem()
			.
			:

kphysm_init() is responsible for taking the system's phys_avail memlist, which is constructed by OBP and passed up to the Solaris kernel by copy_boot_memlists(), and creating the system's memsegs list. Let's take a look...

static void
kphysm_init(page_t *pp, struct memseg *memsegp, pgcnt_t npages,
        uintptr_t kpm_pp, pgcnt_t kpm_npages)
{
        struct memlist  *pmem;
        struct memseg   *msp;
        pfn_t            base;
        pgcnt_t          num;
        pfn_t            lastseg_pages_end = 0;
        pgcnt_t          nelem_used = 0;

        ASSERT(page_hash != NULL && page_hashsz != 0);

        msp = memsegp;
        for (pmem = phys_avail; pmem && npages; pmem = pmem->next) {

                /*
                 * Build the memsegs entry
                 */
                num = btop(pmem->size);
                if (num > npages)
                        num = npages;
                npages -= num;
                base = btop(pmem->address);

                msp->pages = pp;
                msp->epages = pp + num;
                msp->pages_base = base;
                msp->pages_end = base + num;

		/* ... */

                memseg_list_add(msp);
                                        
                /*                              
                 * add_physmem() initializes the PSM part of the page
                 * struct by calling the PSM back with add_physmem_cb().
                 * In addition it coalesces pages into larger pages as
                 * it initializes them.                 
                 */                     
                add_physmem(pp, num, base);     
                pp += num;                          
                msp++;                          
        }                                               
        build_pfn_hash();               
}

As it creates each memseg, it hangs it on the system's memsegs linked list, by calling memseg_list_add(), and it passes the memseg's data (ie. starting page, number of pages in the memseg, and starting page frame number) to add_physmem(), which is itself responsible for the initialization of each 8k page_t and construction of larger pages by coalescing the properly aligned constituent pages. Let's continue by delving into add_physmem().

void
add_physmem(
        page_t  *pp,
        pgcnt_t num,
        pfn_t   pnum)
{
        page_t  *root = NULL;
        uint_t  szc = page_num_pagesizes() - 1;			<--- 1.
        pgcnt_t large = page_get_pagecnt(szc);			<--- 2.
        pgcnt_t cnt = 0;

	/* ... */

        for (; num; pp = page_next_raw(pp), pnum++, num--) {

                add_physmem_cb(pp, pnum);			<--- 3.

		/* ... */

                if (szc == 0) {					<--- 4.
                        pp->p_szc = 0;
                        page_free_at_startup(pp);		<--- 5.
                        continue;
                }

                if (!IS_P2ALIGNED(pnum, large)) {

                        if (root == NULL) {			<--- 6.
                                pp->p_szc = 0;
                                page_free_at_startup(pp);
                                continue;
                        }

                        pp->p_szc = szc;			<--- 7.
                        page_list_concat(&root, &pp);

                        if (++cnt == large) {			<--- 8.
                                page_free_large_ctr(cnt);
                                page_list_add_pages(root, PG_LIST_ISINIT);
                                root = NULL;
                                cnt = 0;
                        }
                        continue;
                }

		/* ... */

                if (num < large) {				<--- 9.
                        pp->p_szc = 0;
                        page_free_at_startup(pp);
                        continue;
                }

                pp->p_szc = szc;				<--- 10.
                cnt++;
                root = pp;
        }
        /* ... */
}

Okay, so let's explain the madness above =:^)...

  1. Get the largest mmu supported page size.
  2. Use the size code to get the number of 8k pages needed to build a page of that size.
  3. Since we know we're accessing a contiguous range of memory, prefetch a page_t.
    /* more on this when we discuss add_physmem_cb() */
  4. If the system does not support large pages...
  5. ... hang the newly initialized page_t in the freelist without the use of synchronization primitives.
  6. If not currently building a large page, then free the page as an 8k page, just as in 5. above.
  7. If root != NULL, then we are building a large page, so use page_list_concat() to link it in as a constituent page.
  8. If we have enough 8k constituent pages to build the large page, hang the large page in the freelist and we start over.
  9. If the pages remaining in the memseg are not enough to build a new large page, then hang in the freelist as an 8k page.
  10. Otherwise, the memseg should have enough pages to build a new large page, so initialize the root page and keep going.

Prefetching; where's the prefetching ???

UltraSparc II and newer processors have some sort of prefetching capability. The same is true of Pentium 4 processors with Streaming SIMD Extensions 2 (SSE2). We focus our discussion on the SPARC platform, since that's where the improvements of bug 4765332 were realized; but, it should be noteworthy that these same enhancements can be made to the x86 version of add_physmem_cb().

As seen in step 3. of add_physmem(), add_physmem_cb() is responsible for prefetching a page_t for future access. add_physmem_cb() is really nothing more than a fancy wrapper function to prefetch_page_w(), which, to no surprise, is an assembly level routine. Here's the code...

#if defined(CHEETAH) || defined(CHEETAH_PLUS) || defined(JALAPENO)

	! UltraSparc III/III+

#define STRIDE1 512
#define STRIDE2 576

        ENTRY(prefetch_page_w)
        prefetch        [%o0+STRIDE1], #n_writes
        retl
        prefetch        [%o0+STRIDE2], #n_writes
        SET_SIZE(prefetch_page_w)

#elif defined(SPITFIRE) || defined(HUMMINGBIRD)

	! UltraSparc II/II+

#define STRIDE1 128
#define STRIDE2 192

        ENTRY(prefetch_page_w)
        prefetch        [%o0+STRIDE1], #n_writes
        retl
        prefetch        [%o0+STRIDE2], #n_writes
        SET_SIZE(prefetch_page_w)

!...

#endif

Pretty simple, huh ? The code is really the same for either UltraSparc III/III+ or UltraSparc II/II+ processor based platforms; the difference between these are HW specific. In UltraSparc III/III+ processors, the prefetch instruction queues are 8 entries deep. This means that at any one point, there can be 8 prefetch instructions in flight. Also, each prefetch instruction can prefetch only 64 bytes of data at a time (see pp. 207 of The Sparc v9 Architecture Manual). Since the sizeof (page_t) is 128 bytes in Solaris 10, it takes two prefetch instructions to bring the entire page_t in to the external cache (E$). Thus, we can really only have 4 prefetches for page_t's outstanding. Hence, we issue prefetches for the 4th page_t ahead of where we currently are. We should expect the first four page_t accesses to stall the pipeline while we warm-up the cache; however, once that's done, we should have a four entry buffer with prefetched page_t's.

Now that we understand these nuances, we go back to the code. Recall that add_physmem() is really a giant for() loop iterating on each page_t pointer (pp). Thus, we calculate the address of the page to prefetch by setting STRIDE1 to 4 * sizeof (page_t) or 512 bytes ahead of where the current pp is. But that only gets us half of a page_t, so we issue a second prefetch for the second half of the page_t, by setting STRIDE2 to STRIDE1 + 64 or 576. Four iterations from now, add_physmem() will access the page_t that was prefetched by this iteration.

In UltraSparc II/II+ processors, the prefetch instruction queues are only 3 entries deep. Since we need two prefetch instructions to bring in an entire page_t, we can only hope to prefetch one page_t ahead. Thus, we calculate the address of the page_t to prefetch by setting STRIDE1 and STRIDE2 to be sizeof (page_t) and sizeof (page_t) + 64, respectively, ahead of where the current pp is.

Aww Nuts; he's back to statistics...

Well, you'd hope that we didn't do all of the above just for naught, so here are the results. Again, a StarFire 15k w/900MHz cpu's was used to gather the data; the only difference was that it was configured with 245GB of physical memory. We performed the same measurements as before, gathering the averages for how long it would take the prefetching kernel to initialize all the pages for the system and calculated the page_t initialization rate.

  Memory = 29323064 8k pages (freemem)

Avg Time = 11.368 Secs

     R        11.368 S
    ---  =  ------------
    1pg     29323064 pgs

      R  = 387.68 nS/pg

If we extrapolate this data into our original 100GB system, we should be able to readily calculate the improvement against our original page_t initialization time.

   Memory = 100 * 1024 * 1024 * 1024 / 8k

          = 13107200 8k pages

       R  = 387.68 nS/pg

page_t Init Time = 387.68 nS/pg * 13107200 pages

                 = 5.081 Secs

We now only take 5.081 seconds to iterate thru the same number of pages. When we compare this against the original 23.4 seconds it used to take, we have:

   Memory = 100 * 1024 * 1024 * 1024 / 8k

          = 13107200 8k pages

Vanilla Solaris page_t init time = 23.4 seconds

Prefetch kernel page_t init time = 5.081 seconds

	Initial Time     23.4 Sec
	------------- = ---------- = 4.605
        Prefetch Time    5.081 Sec

From the above data, we can see that the new kernel can initialize page_t 4.61 times faster than the original vanilla solaris kernel. That's a realized improvement of about 460% by exploiting HW prefetch functionality.

And, there you have it... prefetching for fun and profit; or is it performance =:^) !?

Acknowledgments

Joe Bonasera, Todd Clayton, Michael Corcoran, David Fan, Becky Gill, Eric Lowe, Nils Nieuwejaar, Scott Porter, Mallik Ragampudi, Sarat Ravipati, Andrew Rutz, Mike Shapiro, and Bart Smaalders.


Technorati Tag:
Technorati Tag:



( Jun 14 2005, 11:43:26 AM CDT ) Permalink

20050512 Thursday May 12, 2005

A Blog is Born...

Hi, one and all... My name is Rick Mesta; have been @ Sun for close to 5 yrs now and have been working on NFSv4 for the majority of that time. I've also done a wee bit of kernel vm work, but most of my time these days are spent bringing you such coolness as nfsmapid(1m). More recently, I had the opportunity to do a small presentation @ Connectathon 2005 (http://www.connectathon.org) relating to an already submitted DNS I-D (http://www.ietf.org/internet-drafts/draft-mesta-nfs4id-dns-rr-01.txt). Well... that's all for now folks; but stay tuned... there might be something useful eventually =8^) ( May 12 2005, 04:51:23 PM CDT ) Permalink