Design and Implementation of A Compiler Framework for Helper Threading on Multi-Core Processors

Yonghong Song  Spiros Kalogeropulos  Partha Tirumalai
Scalable Systems Group
Sun Microsystems, Inc.
{yonghong.song, spiros.kalogeropulos, partha.tirumalai}@sun.com

Abstract

Helper threading is a technique that utilizes a second core or logical processor in a multi-threaded system to improve the performance of the main thread. A helper thread executes in parallel with the main thread that it attempts to accelerate. In this paper, the helper thread merely prefetches data into a shared cache and does not incur any other programmer visible effects. Helper thread prefetching has been proposed as a viable solution in various scenarios where it is difficult to prefetch efficiently within the main thread itself. This paper presents our helper threading experience on SUN’s second dual-core SPARC microprocessor, the UltraSPARC IV+. The two cores on this processor share an on-chip L2 and an off-chip L3 cache. We present a compiler framework to automatically construct helper threads and evaluate our scheme on the UltraSPARC IV+ processor. Our preliminary results using helper threads on the SPEC CPU2000 suite show gains of up to 22% on programs that suffer substantial L2 cache misses while at the same time incurring negligible losses on programs that do not suffer L2 cache misses.

1. Introduction

With the widening gap between processor and memory speeds, prefetching has been increasingly important to improve application performance [16, 18, 25, 26]. Currently, prefetching is most effective for memory access streams where future memory addresses can be easily predicted using loop index values [7, 18, 19]. For such access streams, software prefetch instructions are inserted into the program to bring data into cache before the use. Such a prefetching scheme in which the prefetches are interleaved with the main computation is also called interleaved prefetching.

Although it is quite successful for many cases [7, 19], interleaved prefetching tends to be less effective for two kinds of codes. First, for codes with complex array subscripts, memory access strides are often loop variant, even predictable, at compile time. Prefetching in such codes tends to incur excessive overhead as significant computation is required to compute future addresses. The complexity and overhead increase if the subscript evaluation involves loads that themselves must be prefetched and made speculative. One such example is an indexed array access [7]. If the prefetched data is already in the cache, such large overheads can cause a significant slowdown. To avoid risking large penalties, modern production compilers often ignore such cases by default, or prefetch data speculatively, one or two cache lines ahead.

The second class of difficult codes involve pointer-chasing. In these codes, at least one memory access is needed to get the memory address in the next loop iteration. Interleaved prefetching is not able to handle such cases. Several techniques have been proposed to attack pointer-chasing. Luk and Mowry proposes several compiler algorithms for recursive data structures [14, 15]. Their approaches, however, do not solve prefetching for general pointer-chasing codes. The jump-pointer approach [21] requires whole program analysis which may not be possible at compile time. Cahoon and McKinley try to detect the regularity of the memory stream at compile time for Java applications [3]. Adl-Tabatabai et al. develop a runtime scheme in a JIT Java compiler, using hardware counters to identify delinquent loads and memory address profiles to identify regularity of memory access streams [1]. Such information is later used by the runtime system for prefetch instruction insertion. Wu tries to detect the regularity of the memory stream with value profiling [28]. The success of his algorithm depends on how closely the training and actual inputs match each other as well as on how many predictable memory streams exist in the program.

Chip multi-threading (CMT) architectures with shared caches present new opportunities for prefetching. With CMT, another core or logical processor may be used to prefetch data needed by the main thread. Helper thread-
ing is a technique which can perform such prefetching in software. A helper thread, which is created at runtime, executes in parallel with the main thread, and does not have any programmer visible side effects. In our context, the helper thread attempts to prefetch data accessed by the main thread into the shared cache. Since it does not do any computation or stores other than the computation necessary to form prefetchable addresses and maintain approximate (often exact) control flow, the helper thread will typically execute faster than the main thread and act as an effective prefetcher to the main thread.

Prefetching with helper threading naturally handles the cases where interleaved prefetching is ineffective. In codes involving complex array subscripts, prefetching overhead is offloaded to the helper thread. For pointer-chasing codes, helper threading tries to speculatively load or prefetch what could be actually cache missing. Helper threading, however, is not free. Launching the helper thread and synchronization between the main thread and the helper thread incur overhead. Such overhead must be minimized by the compiler as well as the runtime system.

In this paper, we make the following contributions:

- Detailed algorithms for helper threading region selection and code generation are presented. Our scheme is also implemented in a production compiler code base.
- We evaluate our scheme on real hardware, using an UltraSPARC(TM) IV+ processor based system and the SPEC CPU2000 benchmark suite [22]. The UltraSPARC IV+ processor has two on-chip cores and a shared on-chip L2 cache. We compare helper threading performance to the best up-to-date serial performance, and show that our helper threading improves performance by up to 22% for programs that suffer L2 cache misses while at the same time causing minimal degradation on most programs that don’t suffer significant L2 cache misses.

The rest of paper is organized as follows. In Section 2, we describe the architecture of the UltraSPARC IV+ processor. In Section 3, we describe compiler support for helper threading. In Section 4, we discuss the runtime support needed for helper threading. We evaluate our implementation in Section 5, compare to previous work in Section 6, and draw a conclusion in Section 7.

2. Architecture Description

Our helper threading implementation targets the UltraSPARC IV+ microprocessor [6]. This chip has two 4-issue in-order superscalar cores each of which implements the functionality in a significantly enhanced UltraSPARC III design [11]. Each core has its own first level instruction and data caches, both 64KB. Each core also has its own instruction and data TLB’s. The cores share an on-chip 2MB level 2 unified cache which has low latency and adequate bandwidth to support smooth dual core operation. Also shared is a large 32MB off-chip dirty victim level 3 cache. The level 2 and level 3 caches can be configured to be in split or shared mode. In split mode, each core can allocate only in half the cache. However, each core can read all of the cache. In shared mode, each core can allocate in all of the cache. Unless otherwise mentioned, the experimental data presented in this paper are all with the caches in shared mode. Figure 1 shows a simple block diagram of the UltraSPARC IV+ processor.

The UltraSPARC IV+ processor implements the 64-bit SPARC V9 ISA [27] with extensions. With respect to our helper threading study, the implementation of the software prefetch extensions is important. Nine flavors of software prefetch are supported. These include the four flavors in SPARC V9: read once, read many, write once, and write many. These four variants can be either weak or strong. Weak prefetches are dropped if a TLB miss occurs during prefetch address translation. On the other hand, strong prefetches will generate a TLB trap and the prefetch will be processed (after the trap). An instruction prefetch is also provided for prefetching instructions. A control bit in the processor further controls the behavior of weak prefetches. When the 8-entry prefetch queue is full, either they can be dropped or they can stall the processor until a queue slot is available.

In our study, we allow the main thread to use all prefetch variants. Program analysis and compiler options determine the variants used for prefetchable accesses. Unless otherwise mentioned, the helper thread uses only strong prefetch variants. If prefetches were dropped on a TLB miss occurs during prefetch address translation. On the other hand, strong prefetches will generate a TLB trap and the prefetch will be processed (after the trap). An instruction prefetch is also provided for prefetching instructions. A control bit in the processor further controls the behavior of weak prefetches. When the 8-entry prefetch queue is full, either they can be dropped or they can stall the processor until a queue slot is available.

In our study, we allow the main thread to use all prefetch variants. Program analysis and compiler options determine the variants used for prefetchable accesses. Unless otherwise mentioned, the helper thread uses only strong prefetch variants. If prefetches were dropped on a TLB miss, the benefit of helper threading would be lost or vastly diminished. Our systems also had the prefetch control setting to disallow dropping of weak prefetches if the prefetch queue

![Figure 1. Block diagram of UltraSPARC IV+ processor.](image-url)
3. Compiler Support for Helper Threading

3.1. Overview

Our helper threading work focuses on loops. The compiler tries to analyze the program and identify the loop regions which are candidates for helper threading using the following criteria:

- **The loop contains memory accesses which potentially could incur cache misses.**

- **The prefetches generated by the helper thread will trigger cache misses sufficiently early before the prefetched data are used by the main thread.**

- **It is profitable to use a helper thread to generate prefetches for the loop.** The benefit from such prefetching outweighs the cost of using a helper thread.

Figure 2 shows the overall algorithm. In Figure 2(a), a loop hierarchy tree is first built for the whole program, followed by reuse analysis and prefetch candidate identification [7] to introduce only necessary prefetches and avoid issuing redundant ones. The function `prefetching_using_helper_thread_driver` is then called recursively to identify candidates and generate codes for helper threading. In Figure 2(b), if a loop in the loop hierarchy is identified as a helper threading candidate and it is profitable to use a helper thread, the loop will be transformed for helper threading purpose. Otherwise, its immediate inner loops will be examined further.

Due to the dynamic nature of operating system scheduling, the following two issues need to be addressed in code generation:

- Ensure the helper thread will do useful work.
- Avoid slowdown of the main thread.

The first issue is addressed by checking whether the main thread has completed the execution of the loop before the helper thread starts the execution of the corresponding loop, and by the helper thread inquiring periodically whether the main thread has completed the execution of the loop.

The second issue is addressed by avoiding synchronization with the helper thread at the end of the main thread for each particular helper threading loop, and by inserting prefetch instructions in the main thread as in the interleaved prefetching mode.

In the rest of this section, first, we present how loops are selected to be helper threading candidates. Then, we present our approach that determines if it is profitable to use a helper thread for a loop. Finally, the code generation scheme for helper threading is presented.

3.2. Selecting Candidate Loops

The benefits of using a helper thread for prefetching to speed up the main thread come from the following:

- The helper thread has potentially less computation to execute than the main thread. It could, therefore, execute certain loads earlier and bring their values to the shared L2 cache.

- Certain loads, if their loaded values are not used to compute a branch condition or an address used by another load/store, can be transformed into prefetches in the helper thread. Furthermore, stores can also be transformed into prefetches. These prefetches can bring data to the shared L2 cache, representing a potentially significant execution time saving for the main thread. The above load or store is called an effective prefetch candidate, if its address computation depends on at least another load in the same loop body or the load/store is identified as a prefetch candidate by using reuse analysis [7].

If the application is memory-bound, the first potential benefit will be less because the loads in both the main thread and the helper thread could be in the critical path for the application. Our scheme selects candidate loops mainly based on the second potential benefit. In the final code of the helper thread, all effective prefetch candidates will be replaced by strong prefetches to their corresponding addresses, in order to realize the potential benefit for the main thread.

Our compiler encodes alias information derived from pointer and array memory accesses in the data flow graph. The data flow generated by such alias information may be
Figure 3. The algorithm to select helper threading candidate loops.

Figure 4. The algorithm to determine the profitability of using a helper thread for candidate loops.

The implementation of helper threading utilizes the existing automatic parallelization infrastructure which uses a fork-join model [23]. The parallelizable loop will be outlined and a runtime library is called to control dispatching the threads, synchronization, etc. Parallelization involves overhead in the runtime library and also parameter passing overhead due to outlining. The benefit of using a helper thread comes from the potential cache hit in the main thread for some memory accesses which could be cache misses in a single-threaded run. The compiler analyzes the potential benefit of using a helper thread versus parallelization overhead to decide the profitability of using a helper thread for a loop.

Figure 4 shows the algorithm to determine helper threading profitability for a candidate loop. The overhead of parallelization is computed as the runtime library cost, \( startup\_cost \), and the cost of passing various shared and first/last private variables [20], \( parameter\_passing\_cost \). The \( startup\_cost \) is a fixed empirical value and the \( parameter\_passing\_cost \) is the cost of passing the value for one variable, which is also a fixed empirical value, multiplied by the number of variables.

The computation of the helper threading benefit is focused on effective prefetch candidates. For each effective prefetch candidate, the potential saving, \( p\ benefit \), is computed as the total number of memory accesses in one invocation of this loop, \( num\_of\ accesses \), multiplied by the L2 cache miss penalty, \( L2\ miss\_penalty \), multiplied by the potential L2 cache miss rate for this memory access, \( potential\_L2\_miss\_rate \). The \( L2\ miss\_penalty \) is a fixed value given for a specific architecture. In the absence of cache profiling, our approach to determine the \( potential\_L2\_miss\_rate \) value for an effective prefetch candidate is based on the complexity of its address computation and whether a prefetch is available in the main thread. The current values of \( potential\_L2\_miss\_rate \) are determined experimentally for different address computation complexity levels.

---

```
procedure is_helper_thread_candidate (Loop *loop)
if (there exists any calls with side effects in the loop body)
    then
        return FALSE
end if
if (the loop is computation bound) then
    return FALSE
end if
if (there exists no effective prefetch candidate) then
    return FALSE
end if
for (all effective prefetch candidates and conditional branches) do
    if (floating-point computation is required directly or indirectly) then
        return FALSE
    end if
end for
return TRUE
end procedure

procedure is profitable to use_helper_thread (Loop *loop)
    p\_benefit = Startup\_cost + Parameter\_passing\_cost
    p\_benefit = 0
    for (each effective prefetch candidate in the loop body) do
        p\_benefit = p\_benefit + num\_of\ accesses * L2\ miss\_penalty
    end for
    if (both p\_benefit and p\_overhead are known at compile time) then
        if (p\_benefit < p\_overhead) then
            return FALSE
        else
            return TRUE
        end if
    end if
end procedure
```
The computation of the number of accesses for an effective prefetch candidate (\textit{num of accesses}) depends on the availability of the profile feedback information. If the profile feedback information is available, the \textit{num of accesses} is computed as the total number of memory accesses for an effective prefetch candidate divided by the times the loop is accessed, as the overhead is computed for each invocation (not each iteration) of the loop. If the profile data shows that the loop is not accessed at all, the value for \textit{num of accesses} is set to 0.

If the profile feedback information is not available, the value of \textit{num of accesses} is computed based on the compile time information of loop trip counts and branch probability. If the actual trip count is not known at compile time, our approach is to examine whether the trip count can be computed symbolically through some loop invariants. Otherwise, a trip count of 25 will be assumed [10]. For conditional statements, equal probability for if taken/non-taken targets or all case targets of a switch statement is assumed. The total number of accesses, \textit{num of accesses}, will be computed based on trip counts and assigned branch probability information.

The total benefit of using a helper thread for a loop, \( p_{benefit}\), is the summation of the benefits of all effective prefetch candidates. If \( p_{benefit}\) is greater than \( p_{overhead}\) using compile time information, this loop will be a candidate for helper threading. Otherwise, if \( p_{benefit}\) is no greater than \( p_{overhead}\), this loop will not be a candidate. Furthermore, if the compile time information produces inconclusive profitability result with symbolic trip count computation, a two-versioned loop with a runtime condition for profitability \( p_{benefit} > p_{overhead} \) will be generated. At runtime, if the condition is true, the helper threading version will be executed. Otherwise, the original serial version will be executed.

### 3.4. Code Generation

Code generation for a candidate loop to use helper threading involves three phases. In the first phase, code like Figure 5(a) will be generated. The runtime library has been modified to guarantee that if the loop \( t \) is parallelized and two threads are available, the main thread will execute the branch if \( “t == 0” \) is true, and the helper thread will execute the other branch. The purpose is to minimize the overhead for the main thread to avoid the main thread slowdown. The helper thread may incur potential overhead to warm up its L1 cache and the TLB. The else branch loop in Figure 5(a) will be transformed to form a helper thread loop.

In the second phase, a proper helper thread loop will be generated through program slicing and variable renaming. The helper thread loop is a sliced original loop containing only the original control flow and necessary statements to compute conditionals and the effective prefetch candidate addresses. All effective prefetch candidates are replaced by \textit{strong} prefetches to their corresponding addresses. In the helper thread, all loads will become \textit{non-faulting} loads to avoid exceptions, and all stores will be either removed or turned to strong prefetches.

All \textit{upward-exposed} or \textit{downward-exposed} assigned variables in the helper thread loop will be renamed and copy statements of original variables to their corresponding temporary variables are placed right before the helper thread loop. In our scheme, all scalar variables are \textit{scoped as private} variables including first private, or both first and last private (see Section 3.5), so that these temporary variables will get correct values at runtime. Figure 5(b) shows the code after program slicing and variable renaming.

In practice, it is possible that the helper thread could run behind the main thread. If this happens, the helper thread should finish early to avoid doing useless work. In the last phase, the following code is inserted to ensure that the helper thread is terminated when it runs behind the main thread.

- Code to indicate that the main thread loop has completed execution immediately after the main thread loop.
- Code to check whether the main thread loop has completed execution before executing the helper thread loop.
- Code to check whether the main thread has completed execution every certain number of loop iterations in the helper thread loop and all its inner loops. This can be done by adding checking at every loop back edge,
procedure helper_thread_code_gen (Loop *loop);
*/ step 1: generated unsliced helper thread loop */
make a copy of the original loop and generate code like Figure 5(a).
*/ step 2: program slicing and variable renaming for
the helper thread loop. */
for (each effective prefetch candidate in
the helper thread loop) do
mark all statements for its address computation, directly
or indirectly, as undeletable.
mark this load or store to a strong prefetch, and mark it
as undeletable.
end for
for (every branch in the loop body) do
mark all statements for branch condition computation,
directly or indirectly, as undeletable.
mark this branch as undeletable.
end for
delete all the unmarked deletable statements.
for (every upward-exposed or downward-exposed variable v
in at least one assignment in the helper thread loop) do
create a temporary variable tv and an assignment tv = v
right before the helper thread loop.
rename all appearances of v with tv in the
helper thread loop body.
end for
*/ The code like Figure 5(b) is generated. */
/* step 3: insert checking code to prevent the helper thread
from running behind the main thread. */
add an assignment right after the main thread loop to
indicate it has completed execution.
add a check whether the main thread loop has completed execution
or not before executing the helper thread loop.
add code at every back edge of the helper thread loop and
its inner loops to check whether the main thread has
completed execution or not.
make the loop t in Figure 5(c) as DOALL loop and perform
variable scoping as in Section 3.5.
end procedure

Figure 6. The algorithm to transform a software helper threading loop candidate to a
DOALL loop.

which will be illustrated later in detail through an example.

If any checking reveals that the loop in the main thread has
completed execution, the helper thread will stop its work
immediately. Figure 5(c) shows the transformed code. The
loop t in Figure 5(c) is marked as a DOALL loop and will be
later parallelized with the existing automatic parallelization
framework.

3.5. Variable Scoping

For the parallel loop t in Figure 5(c), the compiler scopes
the variables based on the following rules:

- All arrays and address-taken scalars are shared.
- All non-address-taken scalars (including structure members) are private.
- All private scalars upward-exposed to the beginning of
loop t are first private.

- All private scalars downward-exposed to the end of
loop t are both last private and first private. The purpose
is to copy out correct value in case that the scalar
assignment statement does not execute at runtime.

For any downward exposed variables, the runtime library
and outlining code generation have been modified to copy
out the downward exposed variables in the main thread
since all the original computation is done in the main thread.
Figure 6 shows the compiler algorithm to transform a helper
threading loop candidate to a DOALL loop.

3.6. Examples

Figure 7(a) shows an example, whose trip counts cannot
be computed at compile time. We also assume that
the compiler is not able to guarantee that q → data and
p → next access different memory locations at compile
time. If profile feedback data is available, the compiler will
compute the trip count and branch probabilities based on
profile data. Otherwise, the compiler chooses default val-
cues for unknown trip counts and branch probabilities as in
the Section 3.3. Figure 7(b) shows the two-version parallel-
ization transformation. The b1 is the potential benefit for
helper threading and o1 is the parallelization overhead. Both
are compile-time constants. Therefore, at compile time, the
branch will be resolved. Figure 7(c) shows program slic-
ing and variable renaming. Note that the variable tmp.p is
used to copy the original p value. Figure 7(d) shows the
added checking codes to end the helper thread earlier, if
the helper thread runs behind the main thread. The variable
tmp.¢ is used to count the number of iterations in the helper
thread. The variable check.¢, which is a compile-time con-
stant, specifies the number of iterations to check whether
the main thread has finished or not. Note that all back edges in
the helper thread loop or its inner loops are checked. This is
necessary in case that the innermost loop is never or rarely
got executed.

4. Runtime Support for Helper Threading

In Figure 5(c), the compiler creates a parallel loop t
which will spawn the main thread and the helper thread
at runtime. Helper threading shares the same runtime as
automatic/explicit parallelization. For each helper threading
loop, runtime creates one POSIX thread to represent
the helper thread. This POSIX thread will be reused as
the helper thread for subsequent helper threading loops.
Since synchronization may unnecessarily slow down the
main thread, if a helper thread runs behind, we do not want
them to be synchronized at the end of parallel for loop t (in
Figure 5(c)).

Currently, some data (like loop bounds, first private data
and shared data, etc.) are passed from the serial portion
while (p) {
    if (p->data == 0) {
        q = p->data;
        while (q) {
            q->data = c;
            q = q->next;
        }
        p = p->next;
    } else {
        while (p) {
            q = p->data;
            while (q) {
                q->data = c;
                q = q->next;
            }
            p = p->next;
        }
    }
}

Figure 7. A code generation example.

Figure 8. Action taken by the main thread and the helper thread to free shared parallel data.

of the main thread to the runtime library, and then to the outlined routine, which will be executed by both the main thread and the helper thread. Such data, which we call shared parallel data, will be allocated on the heap through malloc routine. The runtime system must find a way to free such space to avoid potential out-of-memory issues.

The main thread will access every piece of shared parallel data. However, the helper thread may not, since either it may be suspended, or it runs far behind so skipping some helper threading loops. Also, for every piece of shared data, the main thread will access it first before the helper thread accesses it, since the main thread activates the helper thread.

Figures 8(a) and (b) show the action taken by the main thread and the helper thread, respectively, to free shared parallel data. The function parameter is the address of the shared parallel data for the current helper threading loop. The functions are called at the beginning of the main thread and the helper thread inside the runtime library, respectively, before delivering control to the outlined routine. The global variables prev_main_data and prev_helper_data are used to record the previously accessed shared parallel data by the main thread and the helper thread, respectively, both of which have an initial NULL value. If the future accessed shared parallel data by the helper thread are not the one currently accessed by the main thread, the helper thread should not continue the stale helper threading loop, which is indicated by the return value should_continue. Since both functions access the shared data, to avoid race condition, the same LOCK/UNLOCK pair is placed in the beginning and the end of both functions.

5. Experimental Results

In this section, we present experimental results with helper threading on the SPEC CPU2000 suite [22]. Our experiments were done on the UltraSPARC IV+ processor as
described in Section 2. All the techniques described in Sections 3 and 4 have been implemented in a prototype based on SUN’s production compiler. We compare our helper threading performance with the best serial performance obtained using the peak compiler flags in SUN’s SPEC submission [22]. Currently, our runtime system does not automatically detect and bind the main thread and the helper thread to the same chip. We do this manually using the environment variable SUNW_MP_PROCBIND [24].

Table 1 shows the distribution of loops in each program including the number of loops accepted and rejected for helper threading. The “total” is the number of loops examined by the procedure in Figure 3. Note that if an outer loop is considered a candidate and profitable, its inner loops are not counted in the “total”. The “bad shape” column shows the number of loops rejected because of multiple loop exits or unknown control flow. Note that to outline a loop, we need a single-entry single-exit region. The “calls” column shows the number of loops rejected because they contain calls with side effects. The “bad comp” column shows the number of loops rejected because the execution of address computation or branch resolution codes may raise an exception. The “no save” column shows the number of loops rejected because the compiler was not able to find any effective prefetch candidate. The “other opt” column shows the number of loops rejected because the compiler determines that outlining it would hurt some other optimization 1. The “overrun” column shows the number of loops rejected because the compiler determines that the helper thread may run too far ahead of the main thread and overwrite the shared cache. The “no benefit” column shows the number of loops rejected because the compiler was not able to find any profitable prefetch candidate. The “2 ver” column shows the number of loops rejected because the compiler cannot be outlined because of constraint conditions as described in Section 3.3. The “1 ver” column shows the number of loops for which two versions with a runtime check are generated as described in Section 3.3. The “%load stalls” column shows the percentage of execution time spent stalled for L2 load cache misses (when helper threading is not used). This is measured using processor performance counters. Note that this counter does not include store stall time which could also be reduced by helper thread prefetching. Note that in Table 1, the only two-versioned loops are in swim. This is because the peak SPEC submission, to which we compare our work, used profile feedback for all benchmarks except swim.

Figure 9 shows the speedup achieved with helper threading (HT). The “GM” shows the geometric mean speedup of all CPU2000fp or CPU2000int benchmarks, respectively. Three benchmarks, equake, lucas, and mcf, achieve a speedup of over 1.1, with a maximum speedup of 1.22 for mcf. fma3d suffers L2 cache misses significantly (Table 1), while our helper threading does not improve performance much. We have work to do in the future in both interleaved prefetching and helper threading for this benchmark. Art and parser run slower with helper threading than without it. This is because memory accesses in the regions selected for helper threading for these two benchmarks are mostly cache resident and the overhead of helper threading becomes significantly greater than the corresponding gain. We have focused on helper threading without cache miss profiling information, in order to increase its acceptance and ease of use. However, we do plan to study and utilize cache miss profiling to avoid such regressions.

To further understand the performance difference, we measured the number of L2 cache misses using performance counters. Figure 10 shows the reduction in L2 cache misses due to helper threading for the main thread. For all benchmarks, helper threading is able to either maintain or reduce the number of L2 cache misses. For equake, L2 cache misses are reduced by over 94%. For crafty and twolf, there is a large reduction in L2 cache misses but it does not translate into performance gain because these programs have a very low L2 cache miss rate to begin with (see Table 1). For gzip, although L2 cache misses are reduced by 25%, performance with helper threading rather degrades by 2%. This is due to compiler phase ordering problems and is an area we are trying to improve currently.
Due to space limitations, we briefly summarize the results of some other interesting experiments we have done.

- It is always profitable to not synchronize at the end of the parallel for in Figure 5(c). The performance difference with and without such synchronization is within 3% for all benchmarks except mcf, where the performance difference is 16%.

- Interleaved prefetching in the main thread remains important even with helper threading. This is because the helper thread only brings the data to the L2 cache and software prefetching in the main thread often does a better job of bringing data from the L2 cache to the prefetch cache than hardware prefetching [7]. For example, prefetching in the main thread can achieve a speedup of 1.8 compared to no prefetching in the main thread for swim.

- Though we have attempted helper threading without cache miss profiling, we use conventional block count profiling for all benchmarks except swim. Without this information, the performance of mgrid, apsi and gcc degraded by 23%, 22% and 8%, respectively. This was due to excessive two-version code generation. In the future, we plan to study how to improve our profitability test and judiciously generate two-version code, for cases where no profile feedback information is available.

- The gains described in Figure 9 are over very aggressive SPEC peak options. We have also obtained the gains due to helper threading with more typical compiler options and where interleaved prefetching is not used. When helper threading is added to this scenario, prefetching is done entirely by the helper thread. Figure 11 shows the results, which clearly demonstrate that the helper thread brings data successfully into the shared L2 cache and significantly accelerating the main thread. The geometric mean improves by a factor of $1.30X$ for CPU2000fp where several benchmarks suffer large L2 cache miss penalties. Most CPU2000int benchmarks do not suffer such penalties and here our helper threading causes minimal degradation.

6. Related Work

Helper threading is not new. Kim et al. present their helper threading experience on a hyperthreaded Pentium(TM) processor [8]. They rely on cache miss profile data for helper threading while we use regular edge profiling or a two-version scheme if profiling data is not available. We have targeted a dual-core design where the helper thread does not contend for pipeline resources with the main thread. Our work does not use any special hardware support for helper threading other than a shared L2 cache. We have presented a detailed and systematic code generation scheme for helper threading. Kim et al. present a source-to-source transformation and simulation framework for helper threading [9, 10]. They use a preprocessor while we directly im-

Table 1. Distribution of loops for helper threading.

<table>
<thead>
<tr>
<th>benchmarks</th>
<th>total</th>
<th>bad_shape</th>
<th>calls</th>
<th>bad comp</th>
<th>no save</th>
<th>other opt</th>
<th>overrun</th>
<th>no benefit</th>
<th>2 ver</th>
<th>1 ver</th>
<th>%LoadStalls</th>
</tr>
</thead>
<tbody>
<tr>
<td>webc</td>
<td>197</td>
<td>0</td>
<td>55</td>
<td>4</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>132</td>
<td>0</td>
<td>2</td>
<td>0.1%</td>
</tr>
<tr>
<td>swim</td>
<td>77</td>
<td>0</td>
<td>19</td>
<td>0</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>55</td>
<td>55</td>
<td>0</td>
<td>0.8%</td>
</tr>
<tr>
<td>mgrid</td>
<td>54</td>
<td>1</td>
<td>19</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>28</td>
<td>0</td>
<td>4</td>
<td>6.8%</td>
</tr>
<tr>
<td>apsi</td>
<td>85</td>
<td>8</td>
<td>12</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>64</td>
<td>0</td>
<td>0</td>
<td>4.3%</td>
</tr>
<tr>
<td>mesa</td>
<td>132</td>
<td>57</td>
<td>25</td>
<td>8</td>
<td>6</td>
<td>0</td>
<td>0</td>
<td>135</td>
<td>0</td>
<td>2</td>
<td>2.5%</td>
</tr>
<tr>
<td>galgel</td>
<td>452</td>
<td>9</td>
<td>38</td>
<td>5</td>
<td>19</td>
<td>0</td>
<td>0</td>
<td>169</td>
<td>0</td>
<td>12</td>
<td>0.8%</td>
</tr>
<tr>
<td>art</td>
<td>85</td>
<td>3</td>
<td>10</td>
<td>13</td>
<td>6</td>
<td>0</td>
<td>0</td>
<td>52</td>
<td>0</td>
<td>1</td>
<td>0.3%</td>
</tr>
<tr>
<td>equake</td>
<td>85</td>
<td>0</td>
<td>14</td>
<td>3</td>
<td>39</td>
<td>0</td>
<td>0</td>
<td>27</td>
<td>0</td>
<td>2</td>
<td>9.5%</td>
</tr>
<tr>
<td>facerec</td>
<td>236</td>
<td>1</td>
<td>38</td>
<td>1</td>
<td>8</td>
<td>2</td>
<td>0</td>
<td>184</td>
<td>0</td>
<td>1</td>
<td>1.4%</td>
</tr>
<tr>
<td>ammp</td>
<td>423</td>
<td>209</td>
<td>14</td>
<td>4</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>179</td>
<td>0</td>
<td>13</td>
<td>9.8%</td>
</tr>
<tr>
<td>lucas</td>
<td>83</td>
<td>9</td>
<td>10</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>52</td>
<td>0</td>
<td>11</td>
<td>4.3%</td>
</tr>
<tr>
<td>fma2</td>
<td>5530</td>
<td>84</td>
<td>62</td>
<td>5</td>
<td>242</td>
<td>0</td>
<td>0</td>
<td>5128</td>
<td>0</td>
<td>9</td>
<td>17.8%</td>
</tr>
<tr>
<td>sixtrack</td>
<td>2358</td>
<td>127</td>
<td>130</td>
<td>15</td>
<td>125</td>
<td>0</td>
<td>0</td>
<td>1954</td>
<td>0</td>
<td>7</td>
<td>0.5%</td>
</tr>
<tr>
<td>apsi</td>
<td>139</td>
<td>10</td>
<td>57</td>
<td>7</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>109</td>
<td>0</td>
<td>5</td>
<td>3.5%</td>
</tr>
<tr>
<td>gzip</td>
<td>301</td>
<td>50</td>
<td>38</td>
<td>0</td>
<td>41</td>
<td>0</td>
<td>0</td>
<td>165</td>
<td>0</td>
<td>7</td>
<td>0.2%</td>
</tr>
<tr>
<td>vpr</td>
<td>623</td>
<td>175</td>
<td>157</td>
<td>2</td>
<td>24</td>
<td>0</td>
<td>0</td>
<td>250</td>
<td>0</td>
<td>13</td>
<td>5.1%</td>
</tr>
<tr>
<td>gcc</td>
<td>3449</td>
<td>1340</td>
<td>563</td>
<td>0</td>
<td>178</td>
<td>0</td>
<td>0</td>
<td>1352</td>
<td>0</td>
<td>16</td>
<td>4.0%</td>
</tr>
<tr>
<td>mcf</td>
<td>66</td>
<td>17</td>
<td>3</td>
<td>0</td>
<td>3</td>
<td>1</td>
<td>0</td>
<td>34</td>
<td>0</td>
<td>8</td>
<td>30.2%</td>
</tr>
<tr>
<td>crafty</td>
<td>416</td>
<td>76</td>
<td>164</td>
<td>0</td>
<td>18</td>
<td>0</td>
<td>0</td>
<td>155</td>
<td>0</td>
<td>2</td>
<td>0.08%</td>
</tr>
<tr>
<td>parser</td>
<td>791</td>
<td>310</td>
<td>193</td>
<td>0</td>
<td>85</td>
<td>0</td>
<td>0</td>
<td>195</td>
<td>0</td>
<td>8</td>
<td>3.1%</td>
</tr>
<tr>
<td>cct</td>
<td>572</td>
<td>73</td>
<td>33</td>
<td>7</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>459</td>
<td>0</td>
<td>0</td>
<td>0.0%</td>
</tr>
<tr>
<td>perlmem</td>
<td>195</td>
<td>426</td>
<td>128</td>
<td>0</td>
<td>22</td>
<td>0</td>
<td>0</td>
<td>619</td>
<td>0</td>
<td>0</td>
<td>2.1%</td>
</tr>
<tr>
<td>gap</td>
<td>2454</td>
<td>643</td>
<td>146</td>
<td>0</td>
<td>34</td>
<td>5</td>
<td>0</td>
<td>1693</td>
<td>0</td>
<td>6</td>
<td>8.5%</td>
</tr>
<tr>
<td>vortex</td>
<td>230</td>
<td>120</td>
<td>15</td>
<td>0</td>
<td>6</td>
<td>1</td>
<td>0</td>
<td>79</td>
<td>0</td>
<td>0</td>
<td>5.6%</td>
</tr>
<tr>
<td>harris</td>
<td>192</td>
<td>35</td>
<td>62</td>
<td>0</td>
<td>32</td>
<td>0</td>
<td>0</td>
<td>53</td>
<td>0</td>
<td>10</td>
<td>6.9%</td>
</tr>
<tr>
<td>twolf</td>
<td>1047</td>
<td>164</td>
<td>175</td>
<td>6</td>
<td>183</td>
<td>1</td>
<td>0</td>
<td>481</td>
<td>0</td>
<td>35</td>
<td>0.07%</td>
</tr>
</tbody>
</table>
plement helper threading inside the compiler. They also require special hardware support.

A number of researchers have tried to construct helper threads by post-processing binaries. Liao et al. use a binary rewriter to generate helper threads [12]. Collins et al. use speculative pre-computation for prefetching [5]. Luk describes the use of helper threading in the simultaneously multithreaded machines [13]. Both [5] and [13] require special hardware support. Brown et al. evaluate helper threading on chip multiprocessors through simulation and also propose several architectural enhancements for better helper threading on chip multiprocessors [2], while we conduct our experiments on real hardware. There have also been attempts to construct helper threads or dynamically perform prefetching at runtime. Moshovos et al. tried this approach by using a special hardware slicing processor [17].

Prefetching for linked-list data structures and general prefetching in a single thread are also not new. Roth and Sohi use jump pointers to compute prefetching targets [21]. Choi et al. show a prefetch engine to perform multi-chain prefetching, a technique to exploit inter-chain memory parallelism [4]. Wu uses value profiling to predict the regularity in irregular streams [28]. Mowry et al. show how prefetching can help performance for regular predictable memory streams in both a uniprocessor and a multiprocessor system [18, 19]. Tullsen and Eggers describe techniques for prefetching effectively in a multiprocessor system [25].

7. Conclusion

In this paper, we have presented a compiler framework to perform helper threading. We have shown techniques for selecting candidate loops, deciding whether candidate loops are profitable, and generating code that is resilient to operating system scheduling. Our method operates without cache miss profile data and without special hardware support. We have implemented our framework in a production compiler and evaluated it on a dual-core processor. In our experiments using the SPEC CPU2000 benchmark suite, helper threading improved performance for codes suffering large L2 cache miss penalties without substantially degrading the performance of the others. The maximum gain observed was 22%.

The following aspects of helper threading merit further study. Easy-to-use cache miss profile information could lead to increased performance gains and reduced regressions. Special hardware support for helper threading would reduce overheads. Given the emergence of processors with a large number of threads, the possibility of having more than one helper thread for a main thread can be considered. New heuristics might help improve prefetching in the helper thread. For example, the indexed array access prefetching...
technique [7] can be applied in the helper thread by default because helper threads can afford to do more work in support of prefetching. Finally, improvements in the runtime library, e.g. automatic processor binding, can make helper threading more effective.

References


