COMPACTING LOADS AND STORES FOR CODE SIZE REDUCTION

A Thesis

presented to

the Faculty of California Polytechnic State University

San Luis Obispo

In Partial Fulfillment

of the Requirements for the Degree

Master of Science in Electrical Engineering

by

Isaac Asay

February 2014
COMPITTEE MEMBERSHIP

TITLE: Compacting Loads and Stores for Code Size Reduction

AUTHOR: Isaac Asay

DATE SUBMITTED: February 2014

COMMITTEE CHAIR: John Oliver, PhD
Associate Professor of Electrical Engineering

COMMITTEE MEMBER: Chris Lupo, PhD
Assistant Professor of Computer Science

COMMITTEE MEMBER: Lynne Slivovsky, PhD
Professor of Electrical Engineering
Abstract

Compacting Loads and Stores for Code Size Reduction

Isaac Asay

It is important for compilers to generate executable code that is as small as possible, particularly when generating code for embedded systems. One method of reducing code size is to use instruction set architectures (ISAs) that support combining multiple operations into single operations. The ARM ISA allows for combining multiple memory operations to contiguous memory addresses into a single operation. The LLVM compiler contains a specific memory optimization to perform this combining of memory operations, called ARMLoadStoreOpt. This optimization, however, relies on another optimization (ARMPreAllocLoadStoreOpt) to move eligible memory operations into proximity in order to perform properly. This mover optimization occurs before register allocation, while ARMLoadStoreOpt occurs after register allocation. This thesis implements a similar mover optimization (called MagnetPass) after register allocation is performed, and compares this implementation with the existing optimization. While in most cases the two optimizations provide comparable results, our implementation in its current state requires some improvements before it will be a viable alternative to the existing optimization. Specifically, the algorithm will need to be modified to reduce computational complexity, and our implementation will need to take care not to interfere with other LLVM optimizations.

Keywords: arm, compiler, loads, memory, optimization, stores
Acknowledgments

I am indebted to Dr. Chris Lupo for his role as advisor over the years. His experience and feedback have been invaluable in the development of this thesis, and his good humor and positivity have pulled me out of many project quagmires. Thanks also to Dr. John Oliver and Dr. Lynne Slivovsky for their roles on my thesis committee, and for the great experiences I had learning from them in the excellent classes they teach. Good teachers shaped my education more than anything, and these three in particular have been responsible for instilling in me a love of computer architecture, design, and engineering.

Over the years, I have had many teachers at Cal Poly who have made learning fun and have opened up entirely new avenues of thought in my life. Besides those already mentioned, I wish to particularly thank Dr. Phillip Nico, Dr. Chris Buckalew, and Dr. Franz Kurfess for introducing me to completely new areas of Computer Science, and keeping the spark of fascination alive as I studied with them. Though my focus now is on engineering, my first academic love has always been mathematics, and I want to particularly express my appreciation for Dr. Todd Grundmeier, who teaches mind-bending geometry with aplomb, and Dr. Dylan Retsek, who made my head hurt in a good way while teaching me how to prove things.

I cannot thank the teachers in my life without recognizing the overwhelming contributions made to my academic career by my parents. Both have encouraged me in my studies for as long as I can remember, and both have consistently modeled a love of learning and a commitment to lifelong learning. I wish to especially...
thank my mother, Cheryl, for her tenacity in teaching me and my siblings as she homeschooled us through high school. I would not be where I am today without her support, perseverance, and positive attitude toward education. She is a model to me of what it looks like to sacrifice for the good of your children.

Finally, this thesis would never have been finished without the patient encouragement of my wonderful wife Emily. When the research was long, she fed me. When there was no light at the end of the tunnel, she was my friend and companion. When I felt like quitting, she wouldn’t let me. Her gentle warrior spirit has kept me going these years to finish what I started. Thank you.
# Contents

List of Tables xi  
List of Figures xii  
List of Listings xiii  

1 Introduction 1  
1.1 Embedded Computing Challenges 1  
1.2 ISAs 3  
1.3 Significance of Memory Operations 4  
1.4 Target Platform: ARM Cortex-A8 Microprocessor 5  
1.5 The Role of Compilers 6  
1.6 Ordering Optimizations 7  
1.7 Target Compiler: LLVM 8  
1.8 Register Allocation During Compilation 9  
1.9 Memory Optimization 9  
1.10 Memory Optimization in LLVM 11  
1.11 Previous Work 13  
1.12 Contribution of This Thesis 16  
1.13 Overview of Thesis 17  

2 Theory 18  
2.1 The Importance of Clustering 18
C.3 Python Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
List of Tables

5.1 Summary of instructions in dijkstra_large . . . . . . . . . . . . . . 76
5.2 Summary of instructions in qsort_large . . . . . . . . . . . . . . . . 79
5.3 Summary of instructions in sha . . . . . . . . . . . . . . . . . . . . 83
5.4 Summary of instructions in sha_driver . . . . . . . . . . . . . . . . 88
5.5 Summary of instructions in bmhasrch . . . . . . . . . . . . . . . . 89
5.6 Summary of instructions in bmhisrch . . . . . . . . . . . . . . . . . 91
5.7 Summary of instructions in bmhsrch . . . . . . . . . . . . . . . . . 92
5.8 Summary of instructions in pbmsrch_large . . . . . . . . . . . . . . 92
5.9 Compilation time (in seconds) comparison . . . . . . . . . . . . . . . 94

A.1 Compilation results of benchmarks . . . . . . . . . . . . . . . . . . 107

B.1 Categories of instructions in dijkstra_large by basic block . . . . 111
B.2 Categories of instructions in qsort_large by basic block . . . . . . 112
B.3 Categories of instructions in sha by basic block . . . . . . . . . . . 114
B.4 Categories of instructions in sha_driver by basic block . . . . . . 115
B.5 Categories of instructions in bmhasrch by basic block . . . . . . 117
B.6 Categories of instructions in bmhisrch by basic block . . . . . . . 119
B.7 Categories of instructions in bmhsrch by basic block . . . . . . . 121
B.8 Categories of instructions in pbmsrch_large by basic block . . . . 123
List of Figures

1.1 Optimization example ........................................ 12
2.1 Existing pass architecture ................................. 19
2.2 New pass architecture ...................................... 20
# List of Listings

<table>
<thead>
<tr>
<th>Section</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>2.1</td>
<td>Basic Load Example</td>
<td>24</td>
</tr>
<tr>
<td>2.2</td>
<td>Basic Store Example</td>
<td>27</td>
</tr>
<tr>
<td>2.3</td>
<td>Example of clustering selection</td>
<td>28</td>
</tr>
<tr>
<td>2.4</td>
<td>Example after only focusing on memory instructions</td>
<td>29</td>
</tr>
<tr>
<td>2.5</td>
<td>Example after only focusing on loads</td>
<td>30</td>
</tr>
<tr>
<td>2.6</td>
<td>Example after only focusing on $r2$ as base register</td>
<td>31</td>
</tr>
<tr>
<td>3.1</td>
<td>The MemOpRecord Class</td>
<td>34</td>
</tr>
<tr>
<td>3.2</td>
<td>The RegisterDependencies Class</td>
<td>36</td>
</tr>
<tr>
<td>3.3</td>
<td>ASM Example 1</td>
<td>37</td>
</tr>
<tr>
<td>3.4</td>
<td>ASM Example 2</td>
<td>38</td>
</tr>
<tr>
<td>3.5</td>
<td>MagnetPass.runOptimization()</td>
<td>40</td>
</tr>
<tr>
<td>3.6</td>
<td>MagnetPass.findLowerBounds()</td>
<td>43</td>
</tr>
<tr>
<td>3.7</td>
<td>MagnetPass.findUpperBounds()</td>
<td>47</td>
</tr>
<tr>
<td>3.8</td>
<td>The ClusterPoint Class</td>
<td>48</td>
</tr>
<tr>
<td>3.9</td>
<td>MagnetPass.getBestRangeOverlap()</td>
<td>48</td>
</tr>
<tr>
<td>3.10</td>
<td>MagnetPass.gatherAtBestRangeOverlap()</td>
<td>53</td>
</tr>
<tr>
<td>4.1</td>
<td>Relevant portion of the LLVM source tree</td>
<td>56</td>
</tr>
<tr>
<td>4.2</td>
<td>ARMLoadStoreOptimizer.cpp class hierarchy</td>
<td>60</td>
</tr>
<tr>
<td>4.3</td>
<td>Block Before Moves</td>
<td>68</td>
</tr>
</tbody>
</table>
4.4  Block After Moves ................................. 69
5.1  Dijkstra ASM Movements ......................... 76
5.2  Qsort Pre ASM Movements ...................... 78
5.3  Qsort Post ASM Movements ..................... 79
5.4  Sha ASM Movements ............................... 84
5.5  bmhasrch ASM Movements ...................... 90

C.1  Optimization Code .............................. 125
C.2  Enabling Optimization ......................... 167
C.3  prototype.py .................................. 169
C.4  instr_ops.py .................................. 177
Chapter 1

Introduction

1.1 Embedded Computing Challenges

While general purpose microprocessors have historically focused on maximizing instruction throughput, the field of embedded computing includes the additional constraints of minimizing power dissipation and die size while still achieving acceptable performance for its desired applications. An embedded processor may often use power from a limited battery, and must thus conserve power used both by the processor and by RAM.

Many compiler techniques used for more general purpose computing can be adapted to reduce power dissipation. For example, reducing the numbers of loads and stores (which are instructions with high latency) can decrease execution time, which is important for general purpose machines in increasing throughput. This same technique can also be used by embedded processors to reduce power dissipation.
pated. Thus, some optimizations are platform agnostic in that they create desirable effects for both general purpose computers and embedded computers.

Embedded computers are often designed for mass production. When many millions of units will be sold, there is more motivation for a company to increase up front development cost to reduce the manufacturing cost of the final product. Even a small reduction in manufacturing costs can lead to substantial savings over the life of a successful product. For most SoC (System on Chip) applications, the largest component present on the die is the memory [20]. All instructions required for these embedded systems to perform their functions must be present in some form of ROM on the SoC. It is very desirable to reduce the physical size of the code stored in the ROM, as by using a smaller code size for the same application, smaller ROM may be used, reducing the overall cost of the embedded system without reducing its feature set. Various techniques have been used in the history of computing for data and code compression; a good survey of compression techniques used for code size reduction was performed in [10].

This thesis will examine one such technique specific to ARM processors to reduce code size without impacting code functionality. In the rest of this chapter, we will examine some of the background necessary to perform this technique, which reduces individual memory instructions by combining them into block memory instructions, yielding reduced code size without affecting the functionality of the program. This is one such method among many that should be considered by companies building embedded systems to reduce their manufacturing costs by reducing ROM sizes.
1.2 ISAs

The most basic technique for reducing code size is to reduce the number of instructions used for a given program. The vocabulary of instructions used by a microprocessor to execute programs is called the Instruction Set Architecture (ISA) [29]. A processor’s hardware is designed to accept a certain range of instructions encoded in machine code using a particular format. Machine code is simply a certain number of sequential bits which when taken together cause the processor to perform a particular instruction. The various types of instructions are called operations, and each operation may use one or more operands. An operand is supplemental information provided to the operation, which may include registers or constant values. Each operation has its own opcode, which is the bit pattern with which the operation is encoded in binary. Opcodes are combined with operands in certain well-defined formats to form a complete operation. Most processors use a fixed-width ISA, meaning that each operation takes up the same amount of memory in the instruction cache. At present, most embedded processors use 32-bit wide instructions, though smaller instruction widths such as 16-bits are also in use for certain applications.

During execution, a processor will fetch an instruction, break it into its constituent parts (opcode, operands, etc) and then execute it. Generally, the operands for an instruction are registers, but operands may also include constant values, such as in the case of using an offset from a particular base register. Instructions may provide various functions, including arithmetic operations, memory access operations, and conditional logic operations. All high-level language functionality must
be reduced by the compiler down to a series of ISA instructions. While some functionality may be relatively simple to emulate (such as a=b+c), others may require chains of low-level instructions to replicate the functionality of the high-level language, such as function calls.

For the more common fixed-width ISAs, the design of the ISA is severely restricted by the fact that every instruction must be contained within a fixed number of bits. Thus, there is a tradeoff between the number of opcodes, the number of registers used as operands, and the register space available to use for an operation. Some processors use a large number of registers, but can only support a small number of operation types. Others are restricted to using only two operands per instruction, which limits instruction flexibility, but provides more available bits for use as opcodes or to index registers. However, for a given number of instructions, a fixed-width ISA will always use the same number of bits regardless of what the operations executed by the processor actually do.

1.3 Significance of Memory Operations

Microprocessors store values in a number of locations. A processor will typically use a number of registers to store values that the processor is currently working with, but register banks are typically small. At some point the processor will need to store a register value elsewhere so it can use the register for another value. Processors will store the value in main memory, often called RAM, which can hold anywhere from several thousand to several billion memory values (often called words),
Depending on the design and application, once stored, the processor can load the values back from main memory into a register and work with the value.

Because registers are part of the processor itself, they can typically be accessed in a single clock cycle. By contrast, reading from or writing to main memory can take hundreds of clock cycles, during which the processor typically remains idle, generating no-ops (instructions that produce no useful output, which are used to stall a processor core) while it waits for the memory to access the value and return it via the memory bus. Many processor designs utilize caches to reduce this delay by storing recently accessed (or predicted potential) values in a small region of memory located very near the processor. Such caches can return values within ones or tens of clock cycles, significantly reducing the penalty for memory accesses. Memory access times are unpredictable, but are nearly always longer than a register access. Hence, the more instructions a compiler generates that use registers rather than memory accesses, the faster the code will typically execute. Many compilers use a number of techniques to reduce memory accesses, allowing the generated code to run faster and use less power [9].

1.4 Target Platform: ARM Cortex-A8 Microprocessor

This thesis will be focused on optimizing code size for a specific ISA and microprocessor: the ARM Cortex-A8.

The ARM Cortex-A8 is a RISC-based microprocessor designed to implement
the ARM ISA and the Thumb-2 ISA [4]. It contains 16 registers, and contains a two level cache system with configurable sizing. The Cortex-A8 uses dynamic branch prediction to guess at which code path will be taken based on previous execution paths. It also contains a memory management unit along with instruction and data translation look-aside buffers to cache virtual address translations and thus reduce memory access latency. The Cortex-A8 has been used in several popular products, including mobile phones such as the Motorola Droid [7] and Palm Pre [1]. It is a popular choice for embedded systems and mobile devices.

In this thesis, we will use a Cortex-A8 processor mounted on a BeagleBoard development board [3]. The processor runs at 720 MHz and is contained in a Texas Instruments OMAP3530 chip. The board contains 2 Gb of SDRAM, as well as several expansion connectors, such as USB 2.0, S-Video, and stereo audio ports.

1.5 The Role of Compilers

Few programs are directly developed in assembly code using a specific target ISA’s instructions. In practice, programs are written in a high-level language and later converted into assembly code. A compiler is a tool used to translate a high-level language expressed in plain text into a low-level assembly language representation that is specific to a target microprocessor. Compilers are usually used in conjunction with assemblers, which take the generated assembly code and translate it into the machine code that is actually executed by the microprocessor [9].

Compilers use several transformative phases which are executed when compiling
a program. These phases are divided into two sections: the front end, which performs analysis of the program, and the back end, which uses this analysis to perform optimization and synthesis of the program to transform it into a target language. The front end takes as input a character encoded program text, performs lexical, syntactic, and semantic analysis of the text, and finally outputs an intermediate representation, which is usually a separate internal language defined by the compiler, to the back end of the compiler. This back end performs machine-independent optimizations to the intermediate representation, generates target-dependent assembly code, performs machine-dependent optimizations upon this assembly language representation, and finally outputs it to the assembler for final compaction.

This thesis will analyze a specific optimization pass within the machine-dependent optimization phase of the compiler, and thus we will not concern ourselves with the front end of the compiler at all. We will assume that the intermediate representation language has already been translated to the target-dependent assembly language, and will focus entirely on optimizing the memory instructions encoded in assembly code.

### 1.6 Ordering Optimizations

As mentioned above, a compiler goes through several phases as it transforms a high level language to an ISA representation. These phases each contain optimizations, which are responsible for reducing instruction count (and thus code size) without affecting program correctness. Generally these optimizations are ordered
based on the level of information needed by an optimization. An example of ordering optimizations can be found in the code generator, which is part of the back end of the compiler. Within the code generator, some optimizations are constrained to occur before or after the register allocation process. Register allocation is the process by which some large set of theoretic registers used by the internal compiler representation is mapped to the fixed, and generally smaller, set of registers present in the actual hardware. Some optimizations may be performed either before or after register allocation. For such an optimization, research must be done to decide where it is most optimal to schedule the optimization based on its performance when scheduled before or after register allocation. This thesis will examine the relative performance of an optimization, and its impact on code size, when implemented entirely after register allocation, compared with an existing implementation requiring phases to run both before and after register allocation.

1.7 Target Compiler: LLVM

The LLVM project is a compiler framework that provides high-level information to compiler transformations [25]. LLVM originally was a university project at the University of Illinois, but was later released with a BSD-style open source license. The framework is implemented in C++, with documentation available on the project website [5].

LLVM uses Static Single Assignment (SSA) form as its internal low-level program representation. In SSA form, there are an infinite number of write-once reg-
isters that may be read multiple times. Thus, a register in SSA form can be used in multiple calculations, but can only be defined once. SSA is used in the internal representation of many compilers due to the ease of applying certain types of optimizations.

1.8 Register Allocation During Compilation

During the compilation process, variables are initially not tied to any particular registers. During the front end of the compilation, the source code is analyzed and converted into an intermediate representation (IR). This representation is then analyzed for machine-independent optimizations such as the elimination of dead (unreachable) code. Eventually, the compiler must perform instruction lowering, which is a process by which the IR is converted to a target-specific representation, such as an assembly language. During this process, which is part of the code generation phase, register allocation must occur. Register allocation is the process of replacing variable names with actual registers, and determining when to store a register to main memory to allow the register to be reused for another variable, a process called register spilling.

1.9 Memory Optimization

As the instructions are lowered to the target representation, machine-dependent optimizations can be performed on the new low-level representation. Such oper-
ations involve target-specific techniques that can reduce the program’s code size without impacting code correctness. Many of these optimizations involve memory, and the arrangement of memory operations. These optimizations can occur before or after register allocation, depending on whether the number and locations of the physical registers affect the optimization.

The ARM ISA, which we will be using as the output for our code generator, is a load/store architecture [6], meaning all values must be loaded into registers in order to be operated on. This is in contrast to a CISC architecture like x86, which is a register memory architecture and thus allows for operations to affect values directly in memory, not just values contained within registers. The ARM ISA contains a simple store operation abbreviated STR. This STR operation saves the contents of a particular register to a specified location in main memory. The STR operation involves only a single register and a single memory location; multiple stores require the use of multiple STR instructions. Later revisions of the ARM ISA included a multiple store operation abbreviated STM. This STM instruction stores several registers to contiguous memory locations in fewer clock cycles [2, 8] and in smaller code size than the associated number of individual STR operations. Thus, a target-specific optimization performed by many compilers is to attempt to find contiguous STR operations which reference contiguous locations in main memory, and convert the STR instructions into a single STM instruction. This reduces code size and increases instruction throughput by reducing the number of processor cycles required to perform the same memory accesses. There also exists analogous instructions for loading values from main memory, which allows memory accesses to be optimized regardless of the direction of data transfer.
In this thesis, we will be examining these multiple memory instructions, and attempt to improve the compiler’s conversion of single memory instructions to multiple memory instructions.

1.10 Memory Optimization in LLVM

The LLVM compiler executes multiple optimization passes to reduce code size and increase performance. It uses passes on both the intermediate representation (IR) code and the machine-dependent assembly code created during the code generation phase of compilation. We will focus the attention of this thesis on one particular machine-dependent optimization performed by LLVM: the Load/Store optimizer.

The LLVM Load/Store optimizer is contained within a class called ARMLoadStoreOpt within the file ARMLoadStoreOptimizer.cpp. This file contains several classes which work together in a two part process to perform a single optimization. These classes operate on basic blocks, which are sequences of code that do not contain jumps or branches, and thus have only a single entry point and a single exit point.

The two parts of the Load/Store optimizer are shown in Figure 1.1. The first part of this optimization occurs before register allocation occurs. At this point, memory instructions within a basic block that share a base register are identified and moved to be contiguous, if this can be safely accomplished.

The second part of this optimization occurs after register allocation is per-
The central member function within the ARMLoadStoreOpt class is the LoadStoreMultiOpt function, which iterates through basic blocks searching for contiguous memory operations (i.e., loads or stores). The member function attempts to combine these contiguous operations into a single multiple-memory operation (i.e., a LDM or a STM) by using base register offsets to correctly order the registers to be loaded or stored.

This thesis will compare and contrast this existing implementation with a newly developed, post-register allocation pass which will replace the analysis performed by the pre-register allocation phase of this optimization. Rather than replacing or modifying the LoadStoreMultiOpt function, we will design a pass that will run on the assembly code directly before the LoadStoreMultiOpt pass executes. The purpose of this pass will be to assist the LoadStoreMultiOpt pass by moving memory operations into advantageous positions within individual basic blocks without violating any dependencies. The memory optimizations will be moved so that they are contiguous if possible, and will thus be optimized into multiple-memory instruc-
tions in the subsequent LoadStoreMultOpt pass.

1.11 Previous Work

Compiler optimization is a prolific area for computer science research, and many such research efforts choose to focus, directly or indirectly, on reducing energy expended during runtime. In the case of direct focus, a particular research effort may explicitly give its results in terms of power dissipation. However, an optimization may indirectly reduce power by reducing the dynamic instruction count, which is most often the metric given by researchers focusing on performance increases. In fact, code generators which focus on reducing power and code generators which focus on reducing execution cycles produce very similar assembly code [35].

Another way an optimization may save power is by reducing overall code size, enabling a smaller ROM chip to be used for a particular embedded application. While the block memory instructions we are examining in this thesis are an example of reducing code size, [26] examines compressing instruction memory through the use of a lookup table leading to substantial savings in program memory requirements. In [19], heuristics are introduced that improve code compression ratios using partial matching. The compression lookup tables can themselves be compressed, further increasing overall code density, as seen in [11]. Apart from compressing instructions, [17] introduces a link time optimization for the ARM platform which reduces the amount of library object code included in embedded contexts at link

13
As mentioned above, an optimization may focus on reducing dynamic instruction count, which often can reduce power losses by reducing overall execution time. An example of two related techniques using post-register allocation optimizations were those proposed in [16] and later extended in [27]. The selection of which optimizations to use is non-trivial, as optimizations may interfere with each other and reduce overall performance. Using profiling in conjunction with performance counters to determine which combinations of optimizations maximize performance for specific workloads was examined in [13].

An example of a research project providing an overview of some low energy compilation techniques is [35]. In this survey of techniques, the authors examine reordering instructions to minimize switching power by reordering instructions to minimize bit flips between neighboring instructions. This work was applied to VLIW mobile devices in [34] and later extended in [14] which formulated heuristics to reduce switching on the instruction bus. The authors of [35] also consider optimizations which reduce memory instructions by improving global register allocation, a topic we will examine more closely below.

Using dynamic power dissipation from bit switching is one way to model power costs, but [36] constructs an instruction level model of overall power loss based on the number of cycles and the overall current required by each CPU instruction. This allows for a much more exact estimate of a program’s overall power dissipation based on the sum of the individual power requirements for each instruction. This work is the basis for examining the utility of many other optimization techniques,
such as [18].

Because of the substantial costs associated with memory accesses (primarily loads and stores), much research has been performed in reducing the number of memory accesses through elimination of redundant code. Some have focused on static code size reduction using more efficient register allocators or other techniques. Two well studied register allocation techniques include graph coloring in [12, 15, 23] and the linear scan technique in [31, 37, 38].

This thesis is concerned with compiler techniques to speed up general purpose register loads and stores by using block memory instructions to move multiple values using one operation. There are other architectural techniques which have been developed to reduce latency when performing the same operation across multiple values. The most famous of these is likely Intel’s MMX extension to its x86 architecture [30]. These extensions, which were introduced in the Pentium II, allowed for multiple packed integer values to be moved between memory and a special bank of MMX register, which allowed vector operations to be performed on multiple values at once. MMX is primarily used for media playback or encoding, and was later extended by Intel’s SSE extensions [32] to work with larger packed values and floating point operations. These operations are known as streaming operations and are a type of SIMD (single instruction, multiple data) operation. The ARM architecture also allows for SIMD using VFP (Vector Floating Point) instructions [4]. These instructions were not true vector operations, as they operated in sequence rather than in parallel, and they were later supplanted by ARM’s NEON media coprocessor. Like Intel’s SSE, NEON performs vector operations in parallel across many values (known as scalars) simultaneously [4]. Unlike Intel’s x86, due to ARM’s load/s-
tore architecture it must load all such values into separate NEON registers before performing operations on the values. While vector operations are an interesting sub-field of computer architecture, this thesis focuses on improving block memory instructions involving general purpose registers, and we will thus not consider the NEON instructions further. Research in [21, 24] examines applying general purpose workloads to streaming SIMD instructions to achieve better parallelism in multi-core environments. Performing SIMD using interleaved data, rather than packed data, is shown to show promising speedups in [28]. While most use of vector operations involves the use of assembly code or special libraries, [33] examines improving compiler support for the use of multimedia vector operations in general purpose workloads, and what the current challenges in this field are.

1.12 Contribution of This Thesis

This thesis contains a method of optimizing memory operations performed by LLVM by heuristically selecting non-contiguous related memory operations and moving them to contiguous locations. This pass will be performed entirely after register allocation has occurred, in contrast to the current, similar optimization performed by LLVM which requires a mix of both pre- and post-register allocation information and actions. This new pass, which we call MagnetPass, will work in conjunction with the existing, post-register allocation LoadStoreMultOpt pass and the effectiveness of this pass will be compared against the existing memory optimization pass, which we will examine briefly in the next chapter. This new pass will be target-specific to ARM processors which use ISAs supporting multiple memory
operations. This pass, in a way similar to the existing memory operation optimization, will reduce code size, increase instruction throughput, and reduce power simultaneously.

1.13 Overview of Thesis

The organization of the rest of this thesis is as follows. We will begin by examining the theory necessary to ensure program correctness in Chapter 2. Next, in Chapter 3, we will delve into the algorithmic constructs necessary to support the theory, and determine a method which will be used to perform the preliminary analysis. We will also examine the heuristic used to determine where instructions should be moved to. In Chapter 4, we will explain the implementation details and specify our class structures. Chapter 5 will contain a summary of our results for particular benchmarks, and a comparison with the existing optimization. Our work will be summarized in Chapter 6, followed by Appendix A which will give further information about the benchmarks selected for use in this thesis. Appendix B will contain the full dataset from our test results. Finally, Appendix C will contain the new source code integrated into the LLVM compiler.
Chapter 2

Theory

This chapter provides an overview of what is required to implement a post-register allocation version of the LLVM memory instruction optimization ARM-LoadStoreOpt (our MagnetPass). First, we will examine memory instructions, and the reasons to cluster them. We will then briefly discuss the differences between loads and stores prior to our examination of how the individual operations can be safely moved in machine-level code. Finally, we will look at where the instructions should be moved to once we determine their range of movements.

2.1 The Importance of Clustering

As mentioned in Section 1.10, the LLVM compiler has a post-register allocation pass that combines contiguous compatible memory instructions into a single multiple memory instruction. This pass is preceded by a pre-register allocation pass
which examines the memory operations and attempts to move them together. Figure 2.1 illustrates this sequence of optimization passes. It also shows that there are many optimizations within the code generator, some of which occur before register allocation and some of which occur afterwards.

**Figure 2.1: Existing pass architecture**

![Diagram of existing pass architecture]

This thesis presents a post-register allocation pass called MagnetPass that runs just before ARMLoadStoreOpt which will move similar memory instructions into clusters (contiguous groups of instructions) without the need for a separate pre-
register allocation pass. These clusters are then optimized into multiple memory instructions by the subsequent ARMLoadStoreOpt pass. This new sequence is shown in Figure 2.2.

**Figure 2.2: New pass architecture**

We wish to compare the efficiency of implementing this clustering phase after register allocation against the existing pre-register allocation implementation.
While running both optimizations while compiling code is possible, it is redundant as both passes should yield similar clusters of memory instructions.

### 2.2 Ranges of Movement

Even after instruction scheduling has occurred, it is still possible to reorder program instructions so long as we maintain program correctness. This is done by analyzing instruction dependencies on surrounding operations, and ensuring that we do not violate any of these dependencies. In particular, we are interested in moving memory instructions, and defining the dependencies governing their movements. We will refer to the space within which we can safely reorder the memory instructions as the instructions’ range of movement. Because loads and stores are usually the definition and final use (respectively) of a variable, moving these instructions affects the live ranges of these variables. We must therefore ensure that our ranges include information that prevents any unsafe movement of memory instructions, particularly as they relate to the definitions and final uses of variables. We will define this criteria more precisely in the upcoming sections.

Note that because our optimization will occur after register allocation, we do not need to worry about later optimizations adding spill code due to the rescheduling performed by our optimization.
2.3 Looking at Loads and Stores

While loads and stores are both classified as memory operations, and are thus closely related, they will be examined individually to allow for differences in their dependency constraints. An example of such a constraint is that before a value can be safely loaded, it must have been stored to the proper memory location. A store has no such constraint, rather we must ensure that before a value is stored, the location the value will be stored must not have any loads pending using its previous value.

2.3.1 Range of Movement for Loads

A load operation pulls a value from memory into a register. While there are many ways to address the memory location, we will first look at the register receiving the loaded value, which we shall denote $R_x$. $R_x$ may have been used in the past for some other value; if this is the case, by the time we encounter the load into $R_x$, this previous value may have been stored to memory. Alternatively, $R_x$ may have been used for some temporary variable which was not finally stored, or this may be the first time $R_x$ is used in this basic block. In either case, we must be sure we do not move the instruction before this last use of $R_x$ or we will violate a write-after-read (WAR) dependency. In addition to these considerations, we must ensure that the base register used to index the load is unchanged by our instruction moving. Thus, if we wish to move the load operation, we cannot move the instruction before the point where the base register for the load’s memory address is defined, or we
will violate a read-after-write (RAW) data dependency. Finally, we must also take care that we do not move the load instruction before a point where the value we are loading is changed via another store to the address we are loading from.

We therefore see that in order to prevent overwriting of an important value contained within $Rx$, an instruction to load $Rx$ has an upper (earliest) range of movement bounded by the nearest of the following:

1. The beginning of the basic block
2. The last use of the previous value contained in $Rx$, including stores
3. The last write to the base register used by the load
4. The last write to the memory address used by the load

Note that because we will be examining instructions at the basic block level, we will consider range of movement to be bounded by the limits of the basic block, if no other limits present themselves earlier. We perform this optimization at the basic block level to match how the existing LLVM optimization, which is a local (bounded by the basic block) optimization, iterates through the instructions. Matching the optimization boundaries, as we do here, will allow for a better comparison between the two implementations.

Alternatively, we may wish to delay the load instruction. The requirements are quite similar, though rather than noting the last use of the value contained in $Rx$, we must take care to not move the load later than the first use of the newly loaded $Rx$ register (a RAW dependency). Thus, we may move a single load instruction down, but not past the nearest of the following:
1. The end of the basic block

2. The first use of the new value to be loaded into Rx

3. The first write to the base register used by the load

4. The first write to the memory address used by the load

We may move the single load anywhere within the range bordered by these two boundaries.

A simple example we shall examine for load operations is the basic block fragment in Listing 2.1.

### Listing 2.1: Basic Load Example

<table>
<thead>
<tr>
<th>Line</th>
<th>Instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>LDR r0, [sp, #4]</td>
</tr>
<tr>
<td>2</td>
<td>ADD r0, r0, r5</td>
</tr>
<tr>
<td>3</td>
<td>LDR r1, [sp, #8]</td>
</tr>
</tbody>
</table>

We wish to combine the LDR instructions as efficiently as possible. The instructions will not be combined unless we move them to be contiguous, as they are currently separated by non-memory instructions. We determine using our above criteria that the LDR r0 instruction cannot be moved to be later than the first use of r0 (the subsequent ADD instruction). Thus, we cannot move the LDR r0 instruction at all. However, we note that the LDR r1 instruction has a range of movement bounded by the top of the basic block. To see this, note that r1 is not used previously in this basic block, no write is performed to the stack pointer, and no stores are performed on memory location sp + 8. We can therefore move the LDR r1
instruction up prior to the ADD instruction and make the two loads contiguous, allowing them to be optimized by the subsequent LoadStoreMultOpt pass.

### 2.3.2 Range of Movement for Stores

A store operation is the inverse of a load operation. Rather than pulling a value from memory into a register, a store copies a value from a register to main memory. As before, we will assume that the register in question is denoted Rx. When moving stores upwards, we must ensure that we do not move the store to be prior to the last write into Rx; moving past this point would cause us to store an register value that has not been fully computed (a RAW dependency). In addition, we cannot move the store before the point where a load instruction loads a value from the memory location used by the store, lest we erase the previous value in the memory location that the load meant to read. We must also ensure that the base register used to index the store is unchanged by our instruction moving, thus we cannot move the instruction above the point where the base register is modified (another RAW dependency). Finally, there is an additional subtle dependency in the form of other stores. If a store using one base+offset moves prior to a store using another base+offset which happens to be an alias of the first base+offset, we could violate a write-after-write (WAW) dependency. Thus, we must make sure we do not reorder stores to the same memory address.

We see that in order to ensure correctness, an instruction to store Rx has an upper range of movement bounded by the nearest of the following:

1. The beginning of the basic block
2. The last modification of the value contained in Rx, including loads

3. The last write to the base register used by the store

4. The last load from the memory address used by the store

5. The store store to the memory address used by the store

Alternatively, we may wish to delay the store instruction. The requirements are quite similar, though rather than noting the last modification of the value contained in Rx, we must take care not move the store later than the first modification of the value to store within the Rx register (a RAW dependency). Thus, we may move a single store instruction down, but not past the nearest of the following:

1. The end of the basic block

2. The first modification of the value in Rx

3. The first write to the base register used by the store

4. The first load from the memory address used by the store

5. The first store to the memory address used by the store

We may move the single store anywhere within the range bounded by these two sets of options.

As before, we will look at a simple example for store operations contained within the basic block fragment in Listing 2.2.
We wish to combine the STR instructions as efficiently as possible. The instructions must therefore move to be contiguous, if possible. We determine using our above criteria that the STR r0 instruction cannot be moved up prior to the last modification of r0 (the prior ADD instruction). Thus, we cannot move the STR r0 instruction at all. However, we note that the STR r1 instruction has a range of movement bounded by the end of the basic block; as r1 is not modified later in this basic block, no write is performed to the stack pointer, and no loads are performed on memory location sp + 4. We can therefore move the STR r1 instruction down below the ADD instruction and make the two stores contiguous, allowing them to be optimized by the subsequent LoadStoreMultOpt pass.

### 2.4 Selection of Cluster Points

Once we have determined the ranges of motion for each of the memory instructions in a basic block, we must decide where to move the individual instructions in order to maximize the number of operations combined into a multiple memory operation. We do this by partitioning the basic block into sets of operations that have memory operations using the same base pointer. This ensures that we do not attempt to cluster operations using different base pointers, as these operations would not be subsequently optimized using the LoadStoreMultOpt pass.
We will look at an example of how this selection process works. Consider the following set of ASM instructions (Listing 2.3):

**Listing 2.3: Example of clustering selection**

```assembly
1 ldr r3, [r2, #-32]
2 ldr r12, [r2, #-12]
3 eor r3, r12, r3
4 ldr r12, [r2, #-56]
5 mov r1, sp
6 ldr r2, [r2, #-64]
7 eor r3, r3, r12
8 eor r2, r3, r2
9 ldr r1, [sp, #324]
10 ldr r0, [sp, #332]
11 bic r1, r1, r0
12 str r1, [r0]
13 ldr r2, [sp, #328]
14 ldr r3, [sp, #344]
15 mov r1, sp
16 cmp r0, #0
```

Because we consider the affects that non-memory instructions have on the ranges of movement in an earlier step, we have no need to examine the non-memory instructions here, so we will ignore these instructions (Listing 2.4):
Listing 2.4: Example after only focusing on memory instructions

<table>
<thead>
<tr>
<th>Line</th>
<th>Instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>ldr r3, [r2, #-32]</td>
</tr>
<tr>
<td>2</td>
<td>ldr r12, [r2, #-12]</td>
</tr>
<tr>
<td>3</td>
<td>ldr r12, [r2, #-56]</td>
</tr>
<tr>
<td>4</td>
<td>ldr r2, [r2, #-64]</td>
</tr>
<tr>
<td>5</td>
<td>ldr r1, [sp, #324]</td>
</tr>
<tr>
<td>6</td>
<td>ldr r0, [sp, #332]</td>
</tr>
<tr>
<td>7</td>
<td>str r1, [r0]</td>
</tr>
<tr>
<td>8</td>
<td>ldr r2, [sp, #328]</td>
</tr>
<tr>
<td>9</td>
<td>ldr r3, [sp, #344]</td>
</tr>
</tbody>
</table>

Because loads can only combine with other loads (and similarly, stores with other stores), we will only consider clustering one type of memory operation at a time. We first consider loads (Listing 2.5):
We see that we have two base pointers here: \( r2 \) and \( sp \). We arbitrarily select \( r2 \) to examine first (Listing 2.6); our algorithm, detailed in Chapter 3, will simply select the first base register encountered in the basic block to examine first.
Finally, we see only the particular set of memory ops with a common base register that we wish to optimize by moving as close together as possible. For each set of instructions with the same base pointer, we will select as a cluster point the location of the greatest overlap of ranges of motion. At this point in the algorithm, we will have already calculated the ranges of motion, so we do not now need to go back and examine each instruction to see if it is safe to move. Note that not all the instructions in this grouping will necessarily be able to move to the cluster point. We are only selecting the most profitable cluster point, where the most of these instructions may safely move. Any instruction which cannot safely move will
be left where it is, and will be analyzed again during a future iteration.

Once we have selected a cluster point, we will then move the memory operations to this point. We will go into more detail about this process, and the selection of the cluster point, in Chapter 3. Note also that the process of partitioning stores to allow for cluster point selection is identical to the process we have just illustrated for loads.

2.5 Summary

Loads and stores each have rules to determine an instruction’s range of movement. Once these ranges have been determined, cluster points are selected using the maximum overlap of these ranges as a heuristic, and the instructions are finally moved to these cluster points, allowing them to be combined into multiple memory instructions during the later LoadStoreMultOpt pass.
Chapter 3

Algorithm

We will now present the algorithm to determine the range of movement for memory ops, as well as the method by which these instructions are moved to more advantageous positions.

In order to determine the maximum and minimum bounds for each range of movement for each instruction contained within a basic block, we must use two passes: one to determine the latest instruction a memory operation can be inserted after and one to determine the earliest instruction it can be inserted after. Once these dependencies are identified, we can determine a cluster point by using this information. We will then move the instructions to this cluster point and repeat until we have processed all memory instructions within each basic block.

This chapter will explain the data structures necessary to maintain dependency information for memory operations, and will also explain what operations are necessary during each function operating on each of the basic blocks.
3.1 Level of Optimization

The current LoadStoreMultiOpt optimization performed by LLVM is run when the compiler is passed an optimization flag of -O1 or greater, and operates at the basic block level. As our optimization is a helper pass designed to run before this pass, we will incorporate our algorithm at the basic block level under similar circumstances.

3.2 Determining Range of Movement

Before we start moving memory instructions, we must analyze their dependency information to ensure that we do not compromise program correctness. To do this, we must first examine the range of movement for each memory instruction.

3.2.1 Tracking Memory Instructions

Our first task in determining range of movement for the memory instructions is to create data structures to contain the pertinent dependency data. We define a MemOpRecord class outlined in pseudocode in Listing 3.1.

```
Listing 3.1: The MemOpRecord Class

1 class MemOpRecord
2     MachineInstr *OpLocation
3     MachineInstr *LowerBound, *UpperBound
```
Each MemOpRecord object holds a pointer to the machine instruction that this record represents. It also holds a pointer to the determined maximum and minimum machine instruction the OpLocation instruction can be safely inserted after. A MemOpRecord's OpLocation is set once during the initialization of the optimization class, and is not modified during the remainder of the code. In contrast, the two bounds pointers initially are set to NULL, and are set when the respective upper and lower bounds are discovered during the discovery passes. These bounds will be updated after every collection of memory ops is clustered, to account for any changes in dependency information that occur due to the movement of these instructions.

To keep track of the memory instructions, we create a vector of MemOpRecords and call it AllMemOps. This vector is initially empty, and will be filled during the initialization of the main optimization function. The MemOpRecords contained within this vector will be reused during the lifetime of the function, with each MemOpRecord's range of motion being updated as instructions are moved.

### 3.2.2 Register Dependency Information

Now that we have a way of tracking the memory operations, we need some way of tracking the dependency information, and eventually ending the ranges when the dependencies are violated. To do this, we make a class called RegisterDependencies (see Listing 3.2).
Listing 3.2: The RegisterDependencies Class

```
class RegisterDependencies
{
  vector<int> use_dep
  vector<int> mod_dep
}
```

A memory instruction may contain two types of register dependencies: a use dependency and a modify dependency. A use dependency is violated when an instruction uses a particular register as part of its execution. An example would be a load instruction, which cannot be delayed past the point where a subsequent instruction requires the use of the value loaded. Similarly, a modify dependency is violated when an instruction modifies a register during its execution. An example of a modify dependency would be a store instruction, which cannot be delayed past the point of the stored register’s first modification. Were it to be thus delayed, it would be entering the beginning of the live range for the next value contained within this register, and would thus be mingling two live ranges within a single register, which is an impossibility.

Accurate tracking of both types of dependencies is necessary to ensure program correctness. We wish to track the memory operations dependent on each individual register, so we create a fixed sized vector of RegisterDependencies to hold this dependency information called RegDeps. We will maintain a vector size of 16, one for each register used by the Cortex-A8 processor, and will thus be able to track which memory operations stored in AllMemOps are dependent on each particular register contained in the RegDeps vector. Each RegisterDependencies object will correspond to a particular register, and each use_dep or mod_dep will be an integer that indexes into the AllMemOps vector, allowing
each register to maintain information about what set of AllMemOps are dependent upon this register.

As an example, consider the two assembly instructions in Listing 3.3.

**Listing 3.3: ASM Example 1**

<table>
<thead>
<tr>
<th></th>
<th>LDR r0, [r2, #8]</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>ADD r0, r0, r1</td>
</tr>
</tbody>
</table>

As we move through the instructions, we first encounter the LDR instruction. We thus insert a MemOpRecord into the AllMemOps vector, and create 3 entries in the RegDeps vector. Specifically, we insert MemOpRecord pointers into RegDeps[r0].use_dep, RegDeps[r0].mod_dep, and RegDeps[r2].mod_dep. Now, if any following instructions use r0 or modify r0 or r2, we will follow the pointers to the proper MemOpRecord contained in the AllMemOps vector, and end that MemOpRecord’s LowerBound. To see this in action, consider the next instruction. We see that the ADD instruction uses the r0 register, and thus we must set the previous LDR instruction’s LowerBound to point at the last instruction (itself), and then remove all the dependencies that point to the MemOpRecord from the RegDeps vector. In effect, the load in this example cannot move down at all, and thus its LowerBound pointer points to itself, signifying that the lowest instruction that the LDR instruction can be inserted below is itself.
3.2.3 Memory Dependency Information

In addition to register dependencies, a memory instruction may also have dependencies with respect to the memory location it operates on. We choose to handle memory dependencies in a fashion similar to register dependencies; we will create a vector of ints called MemDeps, which will contain indicies to the AllMemOps vector of any memory instructions encountered during the findLowerBounds() or findUpperBounds() functions. As these functions move through the basic block, they will check each instruction to see if it is a load or a store. If this current instruction is a load, we will conservatively end the range of movement for any stores we have already encountered which have not already been ended. We do this to avoid the load potentially retrieving a value from the same location in memory as the store uses. Similarly, if the instruction is a store we will end the range of movement of any previously detected loads or stores which have not yet been ended. This dependency checking is overly conservative, and is employed to avoid the need to track pointer aliasing.

As an example, consider the set of instructions in Listing 3.4.

Listing 3.4: ASM Example 2

```
1  LDR r0, [r2, #8]
2  STR r1, [r3, #8]
```

During the findLowerBounds() algorithm, we first encounter the LDR instruction. We add it to the MemDeps vector, and proceed. The next instruction is a STR instruction, which has no register dependencies that affect the LDR instruction. However, we do not know if the base pointer r3 in this case is an alias of the LDR
r2 base pointer, and thus this store is immediately overwriting the same memory location as the LDR instruction is using to populate r0. In light of this ambiguity, the algorithm will conservatively end the LDR range of movement here. Though the r2 and r3 registers might have different values, we have no way of being certain, and we choose to consider the pointer aliasing problem in this regard as future work for this algorithm.

3.2.4 runOptimization()

To run our optimization, the MagnetPass class must be instantiated, and the runOptimization() method must be executed. This is the sole public method of this class, and it is responsible for all initialization, execution, and tear down of our optimization. The pseudocode outline for this method is in Listing 3.5.
Listing 3.5: MagnetPass.runOptimization()

initialize()

while len(AllMemOps) > 0:
    reset all LowerBounds and UpperBounds in AllMemOps to NULL

    run findLowerBounds() to regenerate the LowerBounds
    run findUpperBounds() to regenerate the UpperBounds

    ClusterPoint bestInfo = getBestRangeOverlap(BB)

    if all instructions in bestInfo are not contiguous
        gatherAtBestRangeOverlap(BB, bestInfo)

    cleanUp()

The runOptimization() method will first perform some basic initialization, including creating and initializing the RegDeps vector and populating AllMemOps using the basic block instructions. Next, it will begin a while loop using the exit criteria of AllMemOps emptying. As this loop progresses, AllMemOps will grow smaller as memory instructions are processed and moved, until finally all memory ops have been analyzed and the loop will terminate.

Inside this loop, every MemOpRecord contained within the AllMemOps vector will have its LowerBound and UpperBound pointers reset to NULL. This allows the subsequent findLowerBounds() and findUpperBounds() methods to set the bounds based on the current ordering of instructions within the basic block. Once all the bounds are determined, an instance of the ClusterPoint
class (which will be detailed in a later Section) called `bestInfo` is created. This variable contains information regarding which instructions must move and where they must move to. This information is first analyzed to ensure that the instructions to be moved are not already contiguous; if they are there is no need to reorder them as they should already get optimized by the later `LoadStoreMulti-Opt` optimization without modification. If the instructions are not all contiguous, we will then use the `gatherAtBestRangeOverlap()` method along with the `bestInfo` variable to rearrange the instructions within the basic block.

After all the `AllMemOps` have been processed, we will call a `cleanUp()` method to empty any vectors still containing dependency information.

### 3.2.5 `findLowerBounds()`

The first pass will be used to determine the maximum amount of delay that can be added to each memory operation contained in `AllMemOps`. Because when we consider a code listing, a delay in an instruction’s execution is seen as moving an instruction downward, we use the terms “down” and “lower bound” to refer to to delaying an instruction’s execution with respect to other instructions in the code listing. Similarly, we will use the terms “up” and “upper bound” to refer to scheduling an instruction earlier in a code listing with respect to the other instructions.

Reviewing Chapter 2, we see that when moving through the basic block in order, we can move instructions downwards until one of the following cases occur. For loads:
1. The end of the basic block

2. The first use of the new value to be loaded into the destination register

3. The first write to the base register used by the load

4. The first store to the memory address used by the load, or to an ambiguous address that might be the load’s memory address

And for stores:

1. The end of the basic block

2. The first modification of the value in the source register

3. The first write to the base register used by the store

4. The first load from the memory address used by the store, or to an ambiguous address that might be the store’s memory address

We must ensure that this first pass properly ends the ranges when one of these conditions are met. The algorithm we will use to calculate the LowerBound values is shown in Listing 3.6.
Because this function will be called many times when compiling a program, the first step is to clear any existing register dependencies in preparation for adding new ones. Once this is accomplished, we iterate through each instruction within the basic block and check to see if the instruction violates any existing dependencies, and thus ends a LowerBound.

The endLowerBoundUsingRegsUsed() function examines the use_dep vector for every register used by the instruction passed to it, and if pointers are found to MemOpRecords, the MemOpRecords are updated with new Lower-
Bound pointers, if the pointers are currently unset. This new pointer is the instruction which was passed into the function, and represents the last instruction that the MemOpRecord’s instruction can be inserted after.

The endLowerBoundUsingRegsModified() function performs the exact same function as the endLowerBoundUsingRegsUsed() function, but it examines the mod_dep vectors rather than the use_dep vectors. The function thus examines the instruction’s modified registers rather than used registers.

We have now checked for explicit register dependencies, but we also need to take care that we do not modify data in memory which could be addressed by a memory instruction. To do this, our algorithm includes a endLowerBoundUsingMem() function, which must check for the case that the current instruction is one of two types of memory operations. We first check to see if the instruction is a load. If we have such an instruction, we must ensure that the memory location from which we are loading does not have a pending store for that same location. If this case were to occur, we would be loading a value from memory which has not been stored yet. The first part of this function thus takes in a memory location and iterates through the MemDeps vector, examining each entry to ensure that it is not a store that could potentially use the same memory location as the instruction which was passed into the function. If it locates such a store, it sets the store’s LowerBound to point to the instruction. Note that in this implementation, we ignore the pointer aliasing problem by being conservative; rather than keeping track of memory locations pointed to by registers and offsets, we assume that any store following a load could potentially write to that load’s memory address, and thus conservatively end the LowerBound at that subsequent instruction. While this is a real limitation of
the above method, pointer aliasing is a difficult problem which is only related in passing to the rest of the algorithm. Future work may be done on this algorithm to include a method for better dealing with pointer aliasing.

After examining all loads, we must similarly ensure that if the instruction is a store instruction, that no active load instructions exist in the AllMemOps vector which may use the same memory address. If we neglect this, we could potentially overwrite a value needed by a previous load. As above, we prevent this inconsistency from occurring by examining all MemDeps and ending the LowerBound entries of any subsequent loads. There is also an additional consideration with stores, as there is a possibility of a store with one base+offset moving past a second store with a different base+offset that nonetheless happens to point to the same memory location (the two base+offssets are aliases of each other). If this occurs, we could violate a write-after-write (WAW) dependency, so we conservatively end the LowerBound entries of any subsequent stores as well when examining stores.

Thus, at the conclusion of the first pass, all LowerBound pointers have been set for each memory instruction within the basic block. Next, we must use a second pass to evaluate the UpperBound pointers.

3.2.6 findUpperBounds()

The second pass will be used to determine the upper bound of movement allowable for each memory operation contained in the basic block. Again, we start by reviewing Chapter 2, and see that when moving through the basic block in reverse order, we can move instructions upwards until one of the following cases occur. For
loads:

1. The beginning of the basic block
2. The last use of the previous value in $R_x$
3. The last write to the base register used by the load
4. The last write to the memory address used by the load

And for stores:

1. The beginning of the basic block
2. The last modification of the value in $R_x$
3. The last write to the base register used by the store
4. The last load from the memory address used by the store

We must ensure that the second pass properly ends the upper bounds when one of these conditions are met. It is interesting to note that these requirements exactly match those found in the first pass if the word $first$ is replaced with the word $last$. This suggests that the second pass will be almost identical to the first pass.

The algorithm we will use for the second pass is in Listing 3.7.
This second pass closely matches the first pass; the main difference is that the UpperBound pointers will be modified rather than the LowerBound pointers. We iterate through the basic block in reversed order, so that an instruction with dependencies that exist earlier in the basic block will have its UpperBound pointer set correctly.

Once both passes run, the AllMemOps vector will contain information regarding the range of movement for each memory operation. We must then decide where to move these memory operations.
3.3 Selection of Cluster Points

Once we have all the ranges of movement for the memory operations stored in `AllMemOps`, we are ready to begin reordering the instructions. The instructions are analyzed using the `getBestRangeOverlap()` method, which analyzes the `MemOpRecords` contained in `AllMemOps` and returns an object called `bestInfo`. This object is of a new class called `ClusterPoint`, which is defined in Listing 3.8.

**Listing 3.8: The ClusterPoint Class**

```cpp
class ClusterPoint {
    MachineInstr* insertAfter;
    vector<MachineInstr*> instructionsToGather;
};
```

This class holds information to be passed to the `gatherAtBestRangeOverlap()` function. It contains a vector of `MachineInstr` pointers describing which assembly instructions should be moved, and a pointer to the instruction the `instructionsToGather` instructions should be inserted after. This object is generated by the `getBestRangeOverlap()` function, the algorithm for which is expressed in Listing 3.9.

**Listing 3.9: MagnetPass.getBestRangeOverlap()**

```cpp
getBestRangeOverlap(BB) {
    lookAtLoads = do any loads remain in AllMemOps?
    baseReg = first base reg of memory op of type specified by lookAtLoads
}
```
create empty ClusterPoint objects called currentCluster and bestCluster

add instrs with NULL UpperBounds and given baseReg into currentCluster.instructionsToGather

bestCluster = currentCluster

for instr in BB
    currentCluster.insertAfter = instr
    for memop in AllMemOps
        if memop is proper type according to lookAtLoads
            if this memop has the proper base register
                if this memop’s UpperBound = instr
                    add to currentCluster.instructionsToGather
                if this memop’s LowerBound = instr
                    remove this instr from currentCluster.
                        instructionsToGather

    if len(currentCluster.instructionsToGather) > len(bestCluster.instructionsToGather)
        bestCluster = currentCluster

remove all bestCluster.instructionsToGather from AllMemOps
remove bestCluster.insertAfter from bestCluster.
    instructionsToGather, if relevant

return bestCluster

The algorithm requires some explanation. Initially, we must determine if we will move the loads or the stores, as it makes no sense to cluster loads with stores
because the LoadStoreMultiOpt pass will not combine them. We arbitrarily choose to move all loads first, and then any stores that are present in the basic block. We perform this by setting a boolean variable (lookAtLoads) to true if the type of memory operation we wish to optimize is a load and to false if it is a store. We must then determine which base register we will use to select instructions for clustering, and we choose to use a standard greedy approach and thus select the first base register of the type specified in lookAtLoads to be the baseReg for the rest of this method call.

We then create two instances of the ClusterPoint class called currentCluster and bestCluster. We must have some way of keeping track of the current cluster point (currentCluster.insertAfter) along with the instructions that can move there (currentCluster.instructionsToGather). We also wish to track the current best cluster point, which we define as the cluster point around which we could gather the most contiguous instructions. We have the bestCluster object to hold this information. The bestCluster will be constantly compared to the currentCluster object, and will always be set to the best cluster point we have yet found in the basic block.

We must initially add any MemOpRecords contained in AllMemOps that have NULL UpperBounds to currentCluster.instructionsToGather, assuming they have the base register given by baseReg. These are instructions whose range of movement allow them to move to the very top of the basic block.

Next, we iterate through the basic block, and initially set the currentCluster.insertAfter pointer to the current instruction. We must then determine
which instructions currently in `currentCluster.instructionsToGather` have a `LowerBound` that points to the current instruction. These instructions must be removed from `currentCluster.instructionsToGather`, as the current cluster point is outside of the range of movement for these instructions. Conversely, any instructions whose `UpperBound` points to the current instruction should be added to the `currentCluster.instructionsToGather` vector, as the current cluster point has just entered their range of movement.

Only instructions of the type specified in `lookAtLoads` will be added to `currentCluster.instructionsToGather`, and this is reflected in the first if statement. Additionally, only `MemOpRecords` which have the base register specified by `baseReg` should be considered, which is reflected in the second if statement. Once a `MemOpRecord` has passed these two tests, it will be added or removed from `currentCluster.instructionsToGather` according to the criteria above.

Finally, after every iteration the `bestCluster` object is compared with the `currentCluster` object, and whichever object holds a larger number of `instructionsToGather` entries becomes the new `bestCluster` object. Thus, the `bestCluster` object always contains the most profitable cluster point as well as every instruction that can be moved to that cluster point.

After iterating through the basic block, `bestCluster` will contain all the instructions that should be moved. Because we will move them in the next step, we remove them from `AllMemOps` to prevent them from being considered again. In addition, if `bestCluster.instructionsToGather` contains `bestClus-
ter.insertAfter, we will remove this instruction from bestCluster.instructionsToGather to prevent the cluster point from being moved in the later gatherAtBestRangeOverlap() method. Finally, we return the bestCluster object to the runOptimization() public method.

Once the location and MemOpRecords of the most profitable instructions to be moved is determined, this information is passed to the gatherAtBestRangeOverlap() function.

### 3.4 Instruction Reordering

We now know which instructions to move and where to move them to. We pass that information to the gatherAtBestRangeOverlap() method, seen in Listing 3.10.
Listing 3.10: MagnetPass.gatherAtBestRangeOverlap()

```python

gatherAtBestRangeOverlap(BB, bestCluster)

split bestCluster.instructionsToGather into moveUp and moveDown vectors

for instr in BB from first moveDown to bestCluster.insertAfter
    remove any kill flags on instr regs
    add these removed kill flags to last bestCluster.
        instructionsToGather using the regs

for instr in reversed(BB) from first moveUp to bestCluster.
    insertAfter
        give any moveUp kill flags to last instr using that reg

insert all bestCluster.instructionsToGather after bestInfo.
    insertAfter in BB
```

This function also requires some explanation. The

bestCluster.instructionsToGather memory operations are divided into
two vectors: one for instructions that exist before bestCluster.insertAfter
(stored in the moveDown vector) and another for instructions that exist after (stored
in the moveUp vector). This is done because different procedures are required to
prevent disrupting kill flags (defined below) for each group of instructions.

LLVM appends metadata to the instructions in a basic block, including when
the final use of a register value occurs and the register is available for scavenging
(able to be used for a new variable). This particular data is known internally as the
kill flag. We must prevent a memory instruction from moving past an instruction
that kills one of the memory instruction’s registers. If we do not avoid this scenario, LLVM may scavenge the register before the last use of its variable, effectively destroying program correctness.

To combat this, all the memory instructions in bestCluster.instructionsToGather that must move down scan each instruction they will move past and remove every kill flag that refers to registers that they themselves use. Once this is completed, any registers that used to have kill flags will be subsumed by the memory instructions using those registers, and the kill flags will thus migrate down to the lowest instruction using the value’s register.

On the other hand, any memory instruction that must move up may move past the use of one of its registers, and if the memory instruction itself contains registers with kill flags, correctness will again be compromised. To prevent this, any memory instructions that must move up scan instructions they move past and abdicate the responsibility of killing the register value to the first instruction which uses the register which the memory op kills. Thus program correctness is preserved in both cases.

Once the kill flags have been migrated, the memory instructions are removed from the basic block and inserted back beneath the bestCluster.insertAfter instruction and the method ends. The runOptimization() method will loop again until no more MemOpRecords exist in AllMemOps, and thus all memory instructions have been analyzed and moved to more probably advantageous locations.
Chapter 4

Implementation

Now that we have an understanding of the algorithm and data structures we will use to perform this optimization, we must now look at how to implement these items in the LLVM compiler. We will look at how the existing post-register allocation optimizations are structured in LLVM, and illustrate where we will insert this new optimization. We will also examine some changes that became necessary to properly implement the algorithm in the existing codebase.

4.1 Organization of LLVM Code

The LLVM project is available on the Internet as a Subversion repository, allowing anyone to check out the source code and development history of the project. Once the repository has been checked out, it can be built using the Make tool and tested with a test suite similarly available from the LLVM website. The code, which
is written in the C++ language (along with some C modules and domain specific language components) may be modified and rebuilt. When testing our optimization, we built the standard version 2.8svn, moved the resulting binaries and libraries, then added our optimization and rebuilt the project. This allowed us to have both an unaltered version and a modified version of the binary, leading to easy comparison between the generated testcases.

The source tree contains many files and directories, only a few of which are relevant to this project. A very abbreviated diagram of the directory structure is shown in Listing 4.1.

**Listing 4.1: Relevant portion of the LLVM source tree**

```
USER/src/llvm
 | -autoconf
 | -bindings
 | -cmake
 | -Debug+Asserts
 |---bin
 |---lib
 | -docs
 | -examples
 | -include
 |---llvm
 |-----ADT
 |-----Analysis
 |-----Assembly
 |-----Bitcode
 |-----CodeGen
```
There are a few important directories to note here. First, we have the `Debug+Asserts` directory, which contains the binaries and libraries built by `Make` which are actually executed when running the compiler. We also have the `includes` directory, which contains the header files imported for various classes and
function collections. Finally, we have the lib directory, which contains the implementations for many of these header files. The lib directory contains many files that are used during any compilation process, but it also contains the Target directory, which holds any optimizations and definitions that are specific to a single architecture. In particular, we have shown the subdirectories of the ARM directory. It is in this directory that the ARMLoadStoreOptimizer.cpp file resides, and this is the file containing the existing two-phase memory operations optimization. We will add our code into this file as well, as explained below.

### 4.2 Current Post-Register Allocation Memory Optimization

The file ARMLoadStoreOptimizer.cpp contains the class and method definitions for the MachineFunctionPasses used to optimize memory instructions. An overview of this file is in listing 4.2.
Listing 4.2: ARMLoadStoreOptimizer.cpp class hierarchy

```cpp
Function createARMLoadStoreOptimizationPass
  // Creates and returns instances of below two classes

Class ARMPreAllocLoadStoreOpt
  // To be replaced by our new optimization

Class ARMLoadStoreOpt
  runOnMachineFunction
    LoadStoreMultipleOpti
      MergeLDR_STR
      MergeOpsUpdate
      MergeOps
    MergeBaseUpdateLSMultiple
    MergeBaseUpdateLoadStore
    MergeReturnIntoLDM
```

There are two main classes in this source file: the `ARMPreAllocLoadStoreOpt` class is used prior to register allocation to rearrange memory ops, while the `ARMLoadStoreOpt` class is used to actually perform the memory op merges. We will be attempting to replace the `ARMPreAllocLoadStoreOpt` functionality with our own post-register allocation MagnetPass version. We will continue to use the `ARMLoadStoreOpt` class to actually combine contiguous memory operations. Also included in this file is the `createARMLoadStoreOptimizationPass` function, which is part of the global `llvm` namespace. This function simply creates an object of one of these two classes (depending on whether an argument is passed into it) and returns a `FunctionPass` pointer to its caller pointing
to this new object. The caller is then free to use the object’s methods to perform optimizations.

When the optimization is performed, LLVM passes a MachineFunction reference (i.e., a function of machine-dependent assembly instructions) to the runOnMachineFunction() method within the ARMLoadStoreOpt class. This method breaks the MachineFunction into basic blocks (called MachineBasicBlocks), initializes some member variables for the subsequent methods based on the TargetMachine object derived from the MachineFunction object, and passes the basic blocks to the LoadStoreMultipleOpti() and the MergeReturnIntoLDM() methods, which we will examine below. Like most of the methods within this class, runOnMachineFunction() operates on object references, and will simply return a boolean indicating whether or not the method was successful in performing its task.

The MergeReturnIntoLDM() method is a simple function that checks a basic block for a closing unconditional branch to the link register (LR). If such a branch exists, and the instruction immediately preceding the branch is a multiple load instruction, the method will roll up the branch into the multiple load instruction by loading the LR register contents into the program counter (PC), effectively causing a branch to occur without using an explicit branch instruction. This is one of several simple optimizations that LLVM performs to save code size within this source file.

The LoadStoreMultipleOpti() method is the main hub of activity for this optimization. It iterates through a basic block and calls several helper meth-
ods to merge all series of contiguous memory instructions which do not overwrite their base registers and use varying but not necessarily ordered base register offsets. Each series of such instructions is merged into a single multiple memory instruction using the `MergeLDR_STR()` method, which in turn calls `MergeOpsUpdate()` which tracks the merges and finally performs the merge using the `MergeOps()` method. Along the way, the `MergeBaseUpdateLoadStore()` method will combine a trailing increment of a base register with the preceding memory instruction and the `MergeBaseUpdateLSMultiple()` method will do the same for a multiple memory instruction. These smaller optimizations complement the main optimization of compacting multiple memory instructions into a single multiple memory instruction.

### 4.3 New Optimization Pass Class

The full source code is located in Appendix C. The classes explained here are in an anonymous namespace in the `ARMLoadStoreOptimizer.cpp` file. As mentioned in Chapter 2, we require classes for certain data structures, namely `MemOpRecord`, `RegisterDependencies`, and `ClusterPoint`. Each of these classes are pure data structures, that is they do not have any methods associated with them, including constructors. We also define a class `BBPrinter` with methods to make it easier to debug problems with the optimization by printing out the instructions within the basic blocks. This class and its associated methods do not modify the basic blocks in any way, and the code is included in the Appendix for completeness, but will not be further examined here. Also in this namespace we
include a number of simple functions to work with registers and make boolean conditional statements in the code more readable. Some of these simple functions have been copied from other locations in the LLVM source code, as they give insight into ARM specific instruction nuances.

The most important class defined as part of our optimization is the MagnetPass class, which contains many methods to implement the algorithm explained earlier into C++ code. Apart from a constructor and destructor that exist to initialize and clean up logging framework, the class has only a single public method called runOptimization. This method is the main entry point to the optimization, and when called it iterates through the algorithm described earlier in Chapter 2. This runOptimization method is called once for each basic block, and operates only on the given basic block which is passed by reference to the method, allowing it to operate directly on the machine instructions within the basic block. Each function described in Chapter 2 has a corresponding private method in the MagnetPass class, which modifies the class object’s internal state stored in private object properties. This state primarily includes the AllMemOps array, the RegDeps and MemDeps arrays. As we are implementing this algorithm in C++, array datatypes are stored in C++ standard library vectors. The process of looping through these vectors is accomplished using the C++ idiom of object iterators. For completeness, the source listing also contains our logging framework and logging messages, which also serve as documentation in addition to the code comments.
4.4 Using the New Class During Compilation

The ARMLoadStoreOpt class is the post-register allocation pass which combines sequential memory operations if they have the same base address. It has a public method runOnMachineFunction which takes in a function containing machine instructions and performs the instruction compaction on each basic block contained within. This ARMLoadStoreOpt class is instantiated by the createARMLoadStoreOptimizationPass function, which returns the newly created object to be used to run on machine instruction functions by a pass manager contained in another source file.

The MagnetPass class is instantiated in the constructor of the ARMLoadStoreOpt class. We then simply run its runOptimization method on each basic block just prior to running the instruction merging process normally executed by the runOnMachineFunction method. Thus, our optimization rearranges the order of the memory operations into advantageous locations, and the subsequent merging occurs more efficiently. Because this pass depends on the ARMLoadStoreOpt class to merge instructions, we chose to couple these two passes relatively closely rather than add additional scaffolding to make our MagnetPass class more independent.

4.5 Prototype

During the development of the algorithm, we created a prototype of the new optimization pass in Python. We developed the basic harness necessary to test the new
pass; as an example we created a `MachineInstr` class having the attributes and methods we expected to have in LLVM. The algorithm in Chapter 2 was developed as we created this prototype, and the source code for the prototype is included in Appendix C.

The Python prototype was written as a proof of concept, and contains many Python idioms that do not directly translate into C++. Because the algorithm was designed to work with C++ concepts, such as pointers, that are less prominent in Python, the prototype contains several features included to make the C++ design work when implemented in Python. An example of this are the “handles” attached to several of the classes, which are used in place of more traditional C++ pointers. Using these handles, we can assign machine instructions to a unique identifier that would usually come for free in C++ if we used the address of the memory object.

In addition, when we began implementing the pass in C++, we discovered that some of the methods we had assumed would be available did not exist in LLVM, and we had to make several changes in our final C++ implementation of our algorithm. These changes are explained below. We include the Python prototype because the code is easier to understand than our final C++ code, as a starting point to understand our algorithm.
4.6 Changes to Our Algorithm

4.6.1 Mapping Registers

While a program running on a Cortex-A8 processor only has access to 16 registers (r0 - r15), there are in fact 112 registers the compiler must keep track of. The RegDeps vector of Dependencies only needs to keep track of these 16 registers, and because the registers are internally stored in LLVM as enums, we need a way of mapping these register enums to integers we can use to index the RegDeps vector. LLVM uses a function called getRegisterNumbering() for exactly this purpose. Unfortunately, it can only accept certain numbered registers as valid inputs, and if a passed register does not have a corresponding number, the function will cause LLVM to crash. An example of such a register which has no corresponding number would be the Current Program Status Register (CPSR), which is used to control conditional instruction execution (such as IT blocks). Because we have no control over what registers might be stored in the operands of the machine instructions, we chose to create a similar function called getRegNum() which performs precisely the same mapping, but rather than crashing when passed a bad register our function simply returns -1. The return value is tested by the function caller, and registers that are not in the range r0 - r15 are ignored. Positive return values are used to index into the RegDeps vector.
4.6.2 Deleting RegDeps entries

Our prototype would endeavor to remove all entries stored in the RegDeps vectors once their associated MemOpRecords had their bounds set. This kept the vectors holding only the unused dependencies. This reduced the time needed to update dependencies (the only values in the RegDeps vectors were values that needed updating), but the process of searching for and removing dependencies pointing to MemOpRecords which had already had their bounds updated was time consuming. For our C++ implementation, we substituted a simple if statement during the dependency updating process which checked before overwriting the bounds that the current bound was NULL. If it was not, it had already been set, and the MemOpRecord would be skipped. This if statement removed the need to remove RegDeps entries once their MemOpRecord ranges had been updated. Because the basic blocks on average are short, the time required to skip entries in the RegDeps vectors was deemed to be acceptable, and the vector entries were not removed once their MemOpRecords were updated.

4.6.3 Dependencies at Basic Block Ends

The algorithm implemented in the prototype ran two passes to determine ranges of movement. The first pass would run endLowerBoundUsingRegs() functions as it iterated through the basic block to set the bounds properly when iterating past dependencies. Once all instructions were iterated through, the endLowerBoundUsingRegs() functions were again run to set the remaining dependencies to NULL values. In our C++ implementation, we initialized all dependencies
with NULL values, and any dependencies that were not overwritten in the iteration through the basic block would remain NULL. Because these are the same instructions that can be moved to the ends of the basic blocks, we would want them to have NULL bounds anyway. In this way, by initializing them to NULL, we can avoid this final step of running the `endLowerBoundUsingRegs()` functions after the end of the loop. Similarly, the second pass no longer need the trailing `endUpperBoundUsingRegs()` functions after executing its reversed order iteration through the basic block.

### 4.6.4 Algorithm Rewrite

In the early stages of this project, we employed a simple two pass algorithm to generate dependencies, then moved all memory operations within a basic block iteratively without regenerating the dependency information. This was done to minimize the number of passes through the basic block required to perform the optimization. Unfortunately, several months after the initial design we discovered a flaw in this algorithm. We will illustrate using Listing 4.3.

**Listing 4.3: Block Before Moves**

```
1 ADD r0, r6, #18
2 LDR r1, sp, #4 <-+ UpperBound
3 STR r1, r0, #4 ---|
4 LDR r0, sp, #12 <-+ LowerBound
```

This block is taken from the the `qsort` benchmark. We see that the two pass range of movement generation has been performed and that the `STR` instruction cannot move up beyond the `LDR` instruction pointed to by `UpperBound` (there is
a write dependency on the r1 register). This LDR instruction itself has a range of movement (not shown) with an upper bound of the top of this code block. If this LDR instruction is moved by our optimization, the result can be seen in Listing 4.4.

<p>| | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>LDR r1, sp, #4  &lt;-+ UpperBound</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>ADD r0, r6, #18</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>STR r1, r0, #4  ---</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>LDR r0, sp, #12  &lt;-+ LowerBound</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Listing 4.4: Block After Moves

As shown, the LDR instruction has moved to the top of this block of code. This is perfectly legal, however note that because the UpperBound is a pointer to the LDR instruction, the UpperBound moves with the LDR instruction. As far as the optimization is concerned, the STR can move up above the intermediate ADD instruction. However, in reality if the STR instruction was to move above the ADD instruction it would be violating a dependency on r0 (which is defined by the ADD instruction) and would lead to a violation of code correctness. Clearly, whenever a series of instructions is moved, any range of movements could be outdated. Rather than attempt to track these dependencies on the fly (i.e., attempt to repair them during a move), we elected to rewrite the algorithm so that after every clustering of instructions all register dependencies are cleared and regenerated. Unfortunately this leads to a significant increase in passes through the basic blocks, but this is necessary to prevent the loss of program correctness.

The corrected algorithm was presented in Chapter 3, and differs from our original design. Thus, the code listed for the Python prototype uses this same early version of our design and will not correctly handle this special case. Because the
prototype was designed to aid our refinement of the algorithm and not to be used, we have elected to not update it to match our final algorithm.

4.7 Additional Considerations

Because the pre-register allocation pass we are comparing against is designed to integrate with the `ARMLoadStoreOpt` pass, there is no way to disable the pre-register allocation pass using LLVM command line options without disabling the merging pass as well. We must therefore comment out the pre-register allocation pass in the code to disable it, and thus fairly compare our new optimization against the old one. Because we maintain separately built binaries for versions of LLVM with and without this optimization, comparisons between optimizations were simple.
Chapter 5

Results

Having implemented our MagnetPass post-register allocation algorithm in code, we must now examine how it compares to the existing optimization in LLVM. To do this, we will compile a set of source files and compare the outputs generated by different levels of optimizations.

5.1 Benchmarks Used

When testing compiler performance, it is important to select a suite of source files to serve as a representative sampling of the sorts of programs the compiler would be expected to build. These programs are used to verify that the compiler maintains program correctness and give the compiler (or, in this case, compiler optimization) raw material to use to display the efficacy of a particular compilation technique. There are many benchmark suites in existence that are used to demon-
strate compiler or system performance; an example would be the popular SPEC CPU benchmark suites, which is used to demonstrate CPU-intensive workloads.

Because our optimization focuses on ARM processors, we choose to select a series of benchmarks that targets embedded system workloads; namely the freely available MiBench suite [22]. MiBench is a set of 35 applications that are representative of typical programs executed in an embedded context. These applications are split into a number of different domains: Automotive/Industrial Control, Consumer Devices, Office Automation, Network, Security, and Telecommunications. We choose to select a sampling of these 35 benchmarks to gather data from and use to compare optimizations. We will select the following 4 benchmarks:

5.1.1 dijkstra

The dijkstra benchmark is part of the Network domain, and constructs a large graph using an adjacency matrix representation. It then calculates the shortest path between every pair of nodes by using Dijkstra’s algorithm in $O(n^2)$ time. The source file for this benchmark is 174 lines of C code long, and it generates an ARM ASM file that is 628 lines in length.

5.1.2 qsort

Part of the Automotive/Industrial Control domain, the qsort benchmark implements the well known quick sort algorithm to sort large arrays of strings into ascending order. This is a very common requirement for many different applica-
tions. This program has a source file that is only 55 lines of C code, and generates 289 lines of ARM ASM when compiled.

5.1.3 sha

The SHA (Secure Hash Algorithm) benchmark is part of the Security domain, and is used to generate message digests 160 bits in size for an arbitrary input. The SHA family of hash algorithms is commonly used to generate digital signatures and store password information without revealing a plaintext password. The MiBench suite uses the original SHA algorithm, now known as SHA-0, which was later revised into SHA-1. Later generations of this algorithm (SHA-2, etc) are still in common use today and are published by the National Institute of Standards and Technology (NIST). The source file for this program is 210 lines of C code, which is compiled to 600 lines of ARM ASM. There is also a driver file which applies the dataset to the algorithm in the main file; it is 31 lines of C code and 107 lines of ARM ASM in length.

5.1.4 stringsearch

The Office domain contains the stringsearch benchmark, which simply moves through text looking for a particular string without regard to case. This program is composed of 4 source files containing over 3000 lines of code, though most of those are the input dataset within the driver file. Nearly 5000 lines of ARM ASM is generated by these files, though again most of this is constant data.
Each of these benchmarks comes with its own dataset, and its own set of expected output. For most of these benchmarks, the data is stored separately, and is not reflected in the numbers given above (the notable exception being stringsearch). For more details about the MiBench suite, see Appendix A.

5.2 Method of Testing

To generate our comparisons, we modified each benchmark’s Makefile to use LLVM to compile the program. We generated ARM ASM using a two step compilation procedure. First, we used llvm-gcc to convert the high level language files into LLVM bytecode with the -emit-llvm and -c flags. This bytecode file was then used as input for the llc program in the LLVM library to convert the LLVM bytecode to human readable assembly language. The bytecode was also compiled down to an executable and tested on a BeagleBoard ARM development board running an ARM Cortex-A8 CPU, which implements the ARMv7 ISA. The output was compared against the benchmark references to ensure that program correctness was maintained.

We compiled the benchmarks using a variety of scenarios to illustrate the effectiveness of our post-register allocation optimization, and to compare it against the existing pre-register allocation version, a version with no pre- or post-register allocation optimizations, and a version with no optimizations at all (-O0). This allows us to see how effective the optimization is compared to several different parallel data points.
The version with no optimization is used as a baseline, and due to the large number of other optimizations it disables, we expect that it will always compare unfavorably with any -O3 compiler build. It is denoted in the Tables below as “O0 none”.

The version without either the pre- or post-register allocation optimization includes all the normal optimizations found when using -O3, but the ARMLoad-StoreOpt class which combines single memory ops into multiple memory ops does not have any passes that run before it to move memory ops into advantageous positions. This provides a more realistic baseline, as any useful optimization should increase the rate of single ops turning into multiple ops, whether the optimization runs before or after the conversion process. In the Tables below, we mark this set of results as “O3 nopreorpost”.

The final two Table columns include one or the other of the optimizations which run before the combining process, and are denoted as “O3 pre_ra” and “O3 MagnetPass” respectively. We wish to especially compare these two columns, as it will allow us to make a direct comparison between the pre- and post-register allocation implementations of the optimization.

To ease testing, we maintained three versions of the LLVM compiler built with various combinations of optimizations enabled or disabled. After generating the assembly files, we used a Python script to extract a list of all the assembly instructions and categorize them to give us an overview of the effectiveness of the optimizations. We then examined the assembly files by hand where the aggregate data indicated an interesting modification to the generated assembly files.
### Table 5.1: Summary of instructions in dijkstra_large

<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>O0 none</th>
<th>O3 nopreorpost</th>
<th>O3 pre_ra</th>
<th>O3 MagnetPass</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOAD_SINGLE</td>
<td>160</td>
<td>123</td>
<td>123</td>
<td>123</td>
</tr>
<tr>
<td>LOAD_MULTIPLE</td>
<td>0</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>STORE_SINGLE</td>
<td>92</td>
<td>53</td>
<td>53</td>
<td>53</td>
</tr>
<tr>
<td>STORE_MULTIPLE</td>
<td>0</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>OTHER</td>
<td>149</td>
<td>138</td>
<td>138</td>
<td>138</td>
</tr>
<tr>
<td>Total</td>
<td>401</td>
<td>319</td>
<td>319</td>
<td>319</td>
</tr>
</tbody>
</table>

5.3 Comparative Results

#### 5.3.1 dijkstra

In the case of the *dijkstra* benchmark, neither the pre- or the post-register optimization had any effect on the efficiency of the code. As a matter of fact, the O3 version without the pre optimization is exactly the same as the version with the pre optimization. We will, however, take this as an opportunity to show the differing heuristic used by the MagnetPass optimization. Listing 5.1 displays the modifications to the ASM performed by the post-RA optimization.

#### Listing 5.1: Dijkstra ASM Movements

```assembly
1      bgt .LBB5_2
2      @ BB#1: @ %bb
3      ldr r4, .LCPI5_0
4      +- |
5      ldr r0, .LCPI5_1 |
6      mov r2, #27 |
7      mov r1, #1 |
8      +> ldr r3, [r4]
```
9      bl fwrite
10      | ldr r0, .LCPI5_2
11      | mov r2, #40
12      | mov r1, #1
13      -> ldr r3, [r4]
14      bl fwrite
15  .LBB5_2: @ %bb1
16      ldr r0, [sp, #20]
17      ... snipped ...
18      orr r0, r1, r0, lsl #16
19      ldr r1, .LCPI5_3
20      bl fopen
21      | str r4, [sp, #12]
22      +> str r0, [sp]
23      b .LBB5_7
24  .LBB5_3: @ %bb2
25      @ in Loop: Header=
26      BB5_7 Depth=1
27      ... snipped ...
28      b .LBB5_5
29  .LBB5_4: @ %bb3
30      @ in Loop: Header=
31      BB5_5 Depth=2
32      | add r2, sp, #4
33      | mov r1, r5
34      +> ldr r0, [sp]
In Listing 5.1, we see that the memory ops at lines 4, 10, 22, and 32 are moved downward in the code as far as possible since no suitable cluster point is found. This is performed to reduce the live range of the value held in the registers. However, the density of memory operations is not sufficient to allow for clustering, and Table 5.1 illustrates that there are no additional multiple memory operations.

### 5.3.2 qsort

In qsort, again we see that the pre optimization has little effect on the ASM code, and in fact merely moves a single store as shown in Listing 5.2.

**Listing 5.2: Qsort Pre ASM Movements**

```
1      ldr r2, [sp, #12]
2      add r2, r2, r2, lsl #2
3      add r2, r6, r2, lsl #2
4      -> str r0, [r2, #12]
5      str r1, [r2, #16]
6      +- ldr r1, [sp, #12]
7      add r0, r1, #1
8      str r0, [sp, #12]
9      ... snipped ...
```
<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>O0 none</th>
<th>O3 nopreorpost</th>
<th>O3 pre_ra</th>
<th>O3 MagnetPass</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOAD_SINGLE</td>
<td>81</td>
<td>50</td>
<td>50</td>
<td>51</td>
</tr>
<tr>
<td>LOAD_MULTIPLE</td>
<td>0</td>
<td>4</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>STORE_SINGLE</td>
<td>56</td>
<td>21</td>
<td>21</td>
<td>21</td>
</tr>
<tr>
<td>STORE_MULTIPLE</td>
<td>0</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>OTHER</td>
<td>96</td>
<td>107</td>
<td>107</td>
<td>109</td>
</tr>
<tr>
<td>Total</td>
<td>233</td>
<td>185</td>
<td>185</td>
<td>188</td>
</tr>
</tbody>
</table>

Table 5.2: Summary of instructions in qsort_large

We see that the order of the two stores on lines 4 and 5 has simply been switched. Because the later combining optimization does not care about instruction order, this movement is completely ineffective in assisting the combining optimization. Our code has a condition that if a set of instructions are contiguous, it is unnecessary to reorder them if no other memory ops are present apart from the contiguous block. This avoids this unnecessary movement of instructions.

Though there are two memory instructions, the later combining optimization does not combine them, and thus the movement has no effect. This is due to the fact that the MergeOps method, which is responsible for the final merging of the contiguous instructions, will not merge only two instructions unless the starting offset is equal to 0. If the starting offset is greater than 0, LLVM would need to add an instruction to create the proper offset in a register, then use that register for the multiple memory instruction base register. Thus, we would be adding an instruction to remove an instruction, and MergeOps simply skips this unprofitable exercise.

Though the MagnetPass optimization moves other memory operations, it does not enable any more multiple memory operations to form, due to the sparseness of memory ops in these code regions (see Listing 5.3).
Listing 5.3: Qsort Post ASM Movements

1     str r1, [sp, #12]
2     ldr r0, [lr, #4004]
3     cmp r0, #1
4     * ble .LBB1_11
5     @ BB#1:  @ %bb1
6     add r6, sp, #73, 18 @ 1196032
7     ldr r0, [r6, #4000]
8     ... snipped ...
9     str r0, [sp, #12]
10    .LBB1_3:  @ %bb3
11    @ =>This Inner Loop
           Header: Depth=1
12    +-   
13    |     add r2, sp, #8
14    |     mov r1, r4
15    +>   ldr r0, [sp, #20]
16     bl __isoc99_fscanf
17     cmp r0, #1
18     bne .LBB1_7
19     ... snipped ...
20    @ BB#4:  @ %bb4
21    @ in Loop: Header=
           BB1_3 Depth=1
22    +-   
23    |     add r2, sp, #4
24    |     mov r1, r4
25    +>   ldr r0, [sp, #20]
26     bl __isoc99_fscanf
cmp r0, #1
bne .LBB1_7

... snipped ...

@ BB#5:

@ %bb5

@ in Loop: Header=
BB1_3 Depth=1

+-
| mov r2, sp
| mov r1, r4
+-
| ldr r0, [sp, #20]
bl __isoc99_fscanf
| cmp r0, #1
| bne .LBB1_7

... snipped ...
add r4, sp, #24
ldr r5, .LCPI1_6
bl printf
+-
| mov r2, #20
| ldr r3, .LCPI1_5
| mov r0, r4
+-
| ldr r1, [sp, #12]
| qsort
| mov r0, #0
| b .LBB1_9

... snipped ...
ldr r0, [sp, #12]
ldr r1, [sp, #16]
cmp r1, r0
Besides the previously mentioned loads moving (lines 15, 25, 35, and 47), we must also note a couple of things here. First, by moving the loads down, and past
some of the arithmetic ops below them, to just before the branch instructions, we may be reducing performance, as the arithmetic instructions can act as a buffer of instructions to be performed while the loads are finishing.

Secondly, we see that the post-RA MagnetPass optimization has to potential to actually increase code size when optimizing conditional instructions, as seen by the large block of contiguous instructions on lines 55 - 68 being modified and instructions being added. This occurs because by breaking apart the conditional instructions into a new basic block, we increase the code size, which is the opposite of what we wish to do here. We have noted this problem under the Future Work Section at the end of this thesis, but because we are choosing to focus on increasing code density through combining memory instructions, and because conditional blocks are more rarely seen in practice, we are choosing to ignore this issue to keep our algorithm (and C++ code) simpler.
5.3.3 sha

The sha program has both a core module (sha.c) and a driver module (sha_driver.c), which we will examine separately.

The sha.c file is similar to the qsort benchmark in that the pre RA optimization merely moves a single store, but does not assist in the creation of any new multiple memory operations. We have omitted the code listing due to its similarity to the qsort benchmark, and it not adding anything to our analysis.

Listing 5.4: Sha ASM Movements

```
1   ldr r3, [r2, #-32]
2   ldr r12, [r2, #-12]
3      eor r3, r12, r3
4   + ldr r12, [r2, #-56]
5   ldr r2, [r2, #-64]
6      * eor r3, r3, r12
7      eor r2, r3, r2
8   str r2, [r1, r0, lsl #2]
9   ldr r1, [sp, #344]
10  ... snipped ...
11  b   .LBB0_8
12  .LBB0_7:                           @ %bb6
13  @   in Loop: Header=
14   BB0_8 Depth=1
15  +>  ldr r1, [sp, #324]
16    |  ldr r0, [sp, #332]
17    +-  ldr r2, [sp, #328]
```
ldr r3, [sp, #344]
bic r1, r1, r0
... snipped ...
b .LBB0_11
.LBB0_10:
@ %bb10
@ in Loop: Header=
BB0_11 Depth=1
+-
mov r1, sp
+- ldr r0, [sp, #344]
+- ldr r2, [sp, #332]
+- ldr r3, [sp, #328]
+-
ldr r0, [r1, r0, lsl #2]
eor r1, r2, r3
ldr r2, [sp, #324]
... snipped ...
b .LBB0_17
.LBB0_16:
@ %bb16
@ in Loop: Header=
BB0_17 Depth=1
+-
mov r1, sp
+- ldr r0, [sp, #344]
+- ldr r2, [sp, #332]
+- ldr r3, [sp, #328]
+-
ldr r0, [r1, r0, lsl #2]
eor r1, r2, r3
In Listing 5.4, the clustering algorithm of the MagnetPass optimization is clearly seen, as in several places groups of loads are brought together to form clusters of
sufficient size to be combined by the later optimization. Unfortunately, the clusters are not combined due to them not all using contiguous memory locations. For example, the load instructions at lines 39-41 use memory addresses \( sp + 328, 332, \) and 344 which are not all 4 bytes apart (336 and 340 are missing), and thus the three instructions cannot be combined by the later optimization. While \( sp + 328 \) and 332 could be combined, as mentioned above MergeOps does not combine them because they do not have an initial base register offset of 0.

There are some other peculiarities to note with this code listing. Near the top, at line 4, we see that a \texttt{ldr} appears to be added, but checking the table of instruction counts, we see that this is not the case. In fact, the \texttt{ldr} instruction used to appear above the first \texttt{eor} instruction, but was moved below it. In the process, the \texttt{ldr} register was changed from its original register of \texttt{r4} to \texttt{r12}, and thus was treated as a \texttt{new} instruction that replaced the \texttt{old} \texttt{ldr} instruction. This also explains why the second \texttt{eor} instruction below has been changed; rather than using the original \texttt{r4} register it is using the \texttt{r12} register, and the text diff program we are using reports this as a change, and marks it with an asterisk.

More interesting is the second to last change (lines 50 and 53) where it appears that two instructions have been added. In fact, the top \texttt{sub} instruction previously was a \texttt{str} instruction which was moved down (to the second \texttt{addition} point). The \texttt{str} instruction previously was an instruction that simultaneously stored a value and updated the \texttt{sp} pointer by subtracting 4 from it. Because the \texttt{str} instruction moved below a use of the \texttt{str} value, the instruction was \texttt{split} into an initial subtraction from the \texttt{sp} pointer and then a later store. This is where the new \texttt{sub} instruction comes from, and the reason why there is an extra “other” instruction
Table 5.4: Summary of instructions in sha.driver

<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>O0 none</th>
<th>O3 nopreorpost</th>
<th>O3 pre-ra</th>
<th>O3 MagnetPass</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOAD SINGLE</td>
<td>18</td>
<td>13</td>
<td>13</td>
<td>13</td>
</tr>
<tr>
<td>LOAD_MULTIPLE</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>STORE_SINGLE</td>
<td>17</td>
<td>8</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>STORE_MULTIPLE</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>OTHER</td>
<td>29</td>
<td>30</td>
<td>30</td>
<td>30</td>
</tr>
<tr>
<td>Total</td>
<td>64</td>
<td>53</td>
<td>53</td>
<td>53</td>
</tr>
</tbody>
</table>

listed for the MagnetPass optimization in the Summary Table. Because this move was not useful enough to cause a new multiple memory instruction to be created, it would be better for the original `str` instruction to simply not move at all, and thus the program would not incur the code size penalty of adding an unnecessary new instruction. Unfortunately, our algorithm does not allow us to make this determination (indeed, the original pre RA optimization also does not know whether a change will cause a combination; both pre and post use heuristics).

The sha.driver.c file creates a very small ASM file, and in fact there is no difference between the pre RA optimization and no optimization at all. Our post-RA MagnetPass optimization moves a single `ldr` instruction down, but does not lead to any new combinations. We have omitted both code listings due to neither of them adding to this analysis.

5.3.4 stringsearch

While we have seen clustering in previous benchmarks, until this point we have not seen any actual combining of memory ops into a multiple memory op. The
<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>O0 none</th>
<th>O3 nopreorpost</th>
<th>O3 pre_ra</th>
<th>O3 MagnetPass</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOAD_SINGLE</td>
<td>88</td>
<td>67</td>
<td>65</td>
<td>65</td>
</tr>
<tr>
<td>LOAD_MULTIPLE</td>
<td>0</td>
<td>2</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>STORE_SINGLE</td>
<td>35</td>
<td>24</td>
<td>24</td>
<td>24</td>
</tr>
<tr>
<td>STORE_MULTIPLE</td>
<td>0</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>OTHER</td>
<td>78</td>
<td>63</td>
<td>63</td>
<td>63</td>
</tr>
<tr>
<td>Total</td>
<td>201</td>
<td>158</td>
<td>157</td>
<td>157</td>
</tr>
</tbody>
</table>

Table 5.5: Summary of instructions in bmhasrch

bmhasrch.c file requires one of these two optimizations in order to combine one section of code into such an op. The section of code is shown in Listing 5.5.
As seen here, the original O3 compilation yields two ldr instructions separated by another ldr instruction with a different base. The pre RA optimization moves
Table 5.6: Summary of instructions in bmhisrch

<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>O0 none</th>
<th>O3 nopreorpost</th>
<th>O3 pre_ra</th>
<th>O3 MagnetPass</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOAD_SINGLE</td>
<td>103</td>
<td>69</td>
<td>67</td>
<td>67</td>
</tr>
<tr>
<td>LOAD_MULTIPLE</td>
<td>0</td>
<td>4</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>STORE_SINGLE</td>
<td>51</td>
<td>32</td>
<td>32</td>
<td>32</td>
</tr>
<tr>
<td>STORE_MULTIPLE</td>
<td>0</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>OTHER</td>
<td>96</td>
<td>83</td>
<td>83</td>
<td>83</td>
</tr>
<tr>
<td>Total</td>
<td>250</td>
<td>190</td>
<td>189</td>
<td>189</td>
</tr>
</tbody>
</table>

the lower ldr up, while our MagnetPass optimization moves the upper ldr down, and in both cases they are combined by the later optimization. Because the pre RA optimization occurs before register allocation, the registers used for the resulting ldmia instruction are different, and subsequent code reflects the different registers chosen for these values. Because the post-RA optimization occurs after the registers have been chosen, the register numbers are unchanged.

This listing also illustrates that two contiguous instructions will be combined if the first base register offset is 0, as seen in line 4. While previously we have seen two memory instructions being skipped over, here there is no need to add an instruction to incorporate the starting offset into a base register, and it is thus profitable to combine the two instructions into a single multiple load instruction.

Our optimization also moves some other ldr instructions down in the ASM file, however no further combinations are performed, so we omit them here.

The bmhisrch.c file is very similar to the bmhasrch.c file above, including a single instance where both the pre- and post-RA optimizations enabled a combining of two ldr instructions into a single ldmia instruction. There were also some instances of single ldr instructions moving down, as usual, but no additional com-
Table 5.7: Summary of instructions in bmhsrch

<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>O0 none</th>
<th>O3 nopreorpost</th>
<th>O3 pre_ra</th>
<th>O3 MagnetPass</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOAD_SINGLE</td>
<td>74</td>
<td>53</td>
<td>51</td>
<td>51</td>
</tr>
<tr>
<td>LOAD_MULTIPLE</td>
<td>0</td>
<td>2</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>STORE_SINGLE</td>
<td>36</td>
<td>26</td>
<td>26</td>
<td>26</td>
</tr>
<tr>
<td>STORE_MULTIPLE</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>OTHER</td>
<td>71</td>
<td>60</td>
<td>60</td>
<td>60</td>
</tr>
<tr>
<td>Total</td>
<td>181</td>
<td>142</td>
<td>141</td>
<td>141</td>
</tr>
</tbody>
</table>

Table 5.8: Summary of instructions in pbmsrch_large

<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>O0 none</th>
<th>O3 nopreorpost</th>
<th>O3 pre_ra</th>
<th>O3 MagnetPass</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOAD_SINGLE</td>
<td>78</td>
<td>57</td>
<td>57</td>
<td>58</td>
</tr>
<tr>
<td>LOAD_MULTIPLE</td>
<td>0</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>STORE_SINGLE</td>
<td>45</td>
<td>26</td>
<td>26</td>
<td>26</td>
</tr>
<tr>
<td>STORE_MULTIPLE</td>
<td>0</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>OTHER</td>
<td>93</td>
<td>78</td>
<td>78</td>
<td>80</td>
</tr>
<tr>
<td>Total</td>
<td>216</td>
<td>166</td>
<td>166</td>
<td>169</td>
</tr>
</tbody>
</table>

The bmhsrch.c file is also very similar to the bmhasrch.c file above, including a single instance where both the pre- and post-RA optimizations enabled a combining of two ldr instructions into a single ldmi instruction. There were also some instances of single ldr instructions moving down, as usual, but no additional combinations were performed.

In the pbmsrch_large.c file, there are no modifications made by the pre RA optimization, so the two ASM files are identical. This file contains a rather substantial section of conditional instructions, and as the MagnetPass optimization does not handle this well, we actually add an entire basic block and some extra instructions in place of the logic flow that used to pass through a set of conditional instructions.
This is unfortunate, and in instances like these with conditional sections of code, our algorithm would be advised to simply skip over these sections rather than make the modifications. Due to the length and complexity of the conditional section, and given that this is not an area this optimization is focusing its attention on, we have opted not to include an ASM code listing for this file.

5.4 Effect on Compilation Time

In addition to checking correctness and that our algorithm is working as intended, we also wish to compare the time it takes to execute our new pass with the time expended by the old pass. In Table 5.9, we show the comparative analysis. For each file, we used LLVM’s \texttt{--time-passes} parameter to measure the total compilation time used by llvm in building the assembly (*.s) files. We built each file five times and measured and compared the user+system time, rather than the wall-clock time. We choose to measure total compile time, rather than time the individual pass. We do this because the benchmark files we are using are relatively small, and the compile time for each is very short. The \texttt{--time-passes} option reports with a minimum granularity on the order of a few hundreds of microseconds. Because our total compilation times are on the order of tens of milliseconds, the shorter pass-specific times are less accurate for calculating a percent difference.

We see in Table 5.9 that our new pass takes on average 35% more time to compile than the existing two phase optimization. Much of this is likely due to our algorithm requiring a pair of passes through the basic block to regenerate our range
of movement, and the fact that we must perform this pair of passes before every set of memory ops that must be clustered together. Thus, the more sets of memory ops with differing base registers there are in a basic block, the more passes must occur. This relationship is linear with the number of clusters, and is affected not only by the absolute number of memory ops in a basic block, but also the number of different base registers these ops use. By contrast, the current pre-register optimization only iterates forward through the basic block, and is thus less affected by these same two properties.

5.5 Analysis

These four benchmark examples show that in most cases the MagnetPass post-register allocation optimization we have developed provides similar improvements to code size as the existing optimization that runs before register allocation. In the dijkstra and qsort benchmarks, we see that the optimization reduces the

<table>
<thead>
<tr>
<th>Filename</th>
<th>Pre_RA</th>
<th>MagnetPass</th>
<th>% diff</th>
</tr>
</thead>
<tbody>
<tr>
<td>dijkstra</td>
<td>0.0960</td>
<td>0.1368</td>
<td>42.50</td>
</tr>
<tr>
<td>qsort</td>
<td>0.0648</td>
<td>0.0888</td>
<td>37.04</td>
</tr>
<tr>
<td>sha_driver</td>
<td>0.0208</td>
<td>0.0224</td>
<td>7.69</td>
</tr>
<tr>
<td>sha</td>
<td>0.1184</td>
<td>0.1936</td>
<td>63.51</td>
</tr>
<tr>
<td>pbmsrch_large</td>
<td>0.0536</td>
<td>0.0712</td>
<td>32.84</td>
</tr>
<tr>
<td>bmhisrch</td>
<td>0.0616</td>
<td>0.0808</td>
<td>31.17</td>
</tr>
<tr>
<td>bmhsrch</td>
<td>0.0432</td>
<td>0.0584</td>
<td>35.19</td>
</tr>
<tr>
<td>bmhasrch</td>
<td>0.0520</td>
<td>0.0656</td>
<td>26.15</td>
</tr>
<tr>
<td>Average</td>
<td>0.0638</td>
<td>0.0897</td>
<td>34.51</td>
</tr>
</tbody>
</table>

Table 5.9: Compilation time (in seconds) comparison
live ranges of register values by moving loads down if no clustering is possible. In the sha benchmark, we see clustering occur in probabilistically advantageous ways, moving distant memory ops into contiguous blocks, which increases the probability that the later combining optimization will combine them. Unfortunately, due to the lower density of single memory instructions in these two benchmarks, we do not see any clustering performed that allows the ARMLoadStoreOpt optimization to generate block memory instructions. However, in the stringsearch benchmark, we see this clustering actually result in a combination of two memory ops, reducing code size in a way similar to the pre register allocation optimization, but with the added benefit of moving loads down and thus (slightly) reducing the live range of the variable and reducing register pressure.

We must also note some of the disadvantages to this approach. Many of these are simply due to the relative maturities of the two code bases, and our reluctance to over-complicate the algorithm by accounting for certain edge cases. In both qsort and stringsearch, we see that the post-register allocation optimization does not currently gracefully handle conditional code blocks, and can thus increase code size if a memory op within a conditional block is moved. Also, in the sha benchmark, we see an instance where our optimization’s unawareness of certain micro-optimizations causes code expansion by moving a memory op away from an arithmetic op that affects the memory op’s base register. A more robust version of our optimization would have to account for these issues in its heuristic algorithm.

More broadly, we must also compare the two optimizations by which point in compilation provides more information to assist in moving memory ops for the purposes of combining them. Because the LLVM bytecode model is an infinite register
machine (due to this intermediate representation using SSA), any target machine with a fixed set of registers (such as ARM) will introduce the possibility of aliasing or spilling a value, particularly in situations with increased register pressure. Thus, it is generally more advantageous to perform this sort of optimization prior to register allocation, as discrete values are more easily identified and tracked when these two issues are not a concern. In situations with higher register pressure, we expect the pre-register allocation algorithm to outperform our post-register allocation optimization for this reason. Some of these advantages are reduced somewhat by the set of functions and properties LLVM makes available for this purpose, such as the kill flag marking the last use of a register for a particular value. Further examination would be necessary to examine the relative efficiencies of these two domains, and their respective functions.

Our analysis thus indicates that while it is possible to create a very serviceable post-register allocation version of the existing optimization (as we have made the beginnings of here), there exist a number of advantages to implementing this optimization before register allocation has occurred. In addition, our heuristic of blindly moving all memory ops to reduce their live ranges may lead to more disadvantages due to missed later optimizations than advantages due to reduced register pressure.
Chapter 6

Summary

In this thesis, we examined a pre-register allocation memory optimization in LLVM and attempted to recreate it as a post-register allocation optimization (our MagnetPass) and compare the results. We examined what must be present for such an optimization to work, and developed an algorithm to generate a data structure containing safe ranges to move each memory op. We then implemented this using C++ in the LLVM compiler and tested it against several benchmarks from the MiBench benchmark suite to prove its operation. Finally, we compared our optimization against the existing pre-register allocation optimization both using the raw data from the benchmark compilation and using reasoning about which would likely have access to better data more quickly. We found that our algorithm is on average 35% slower than the existing implementation, but provides comparable results outside of a few edge cases.

Our optimization has several clear deficiencies, many of which are noted in the
below Section regarding Future Work. However, the optimization does correctly identify memory ops that may be safely moved into contiguous blocks and successfully moves them whilst updating the basic block instructions if necessary. With additional effort, most of the disadvantages noted in Chapter 5 could be overcome and the optimization would likely produce code sizes of similar quality to the existing pre-register allocation optimization. However, while the code generated would be of similar quality, because the information required for this algorithm is more easily accessed before transforming the LLVM intermediate representation in SSA form to the target ARM ISA, the pre allocation optimization would almost certainly be more efficient in terms of time taken to execute. Because of this, and because of the existing maturity of the current implementation for this optimization, it is unlikely that development on our optimization will continue or be submitted to the LLVM project for consideration or improvement.

### 6.1 Future Work

As mentioned in Chapter 3, our current implementation does not handle pointer aliasing for base pointers. Instead, it conservatively ends all stores when a load is detected, and vice versa. It would be desirable to improve the flexibility of this optimization pass by implementing handling of pointer aliasing.

Currently, this implementation doesn’t attempt to track changes to base pointers; if the base pointer changes, we do not attempt to change any of the subsequent offsets using that base pointer to widen the range of movements. Implementing
this could allow for wider range of movements, and thus more efficient collecting of
related memory operations.

Our optimization also does not work with conditional instructions, and in fact
increases the code size by removing the conditional instructions and inserting a
basic block. This does not really affect the main comparison made by this thesis,
but if this were to be turned into a real compiler optimization to be submitted to the
LLVM project, support for conditional instructions would have to be added.
Bibliography

[1] The palm pre vs. apple’s iphone 3g: Preliminary results and 3gs discussion.  


[8] What is the fastest way to copy memory on a cortex a8? http:


[31] M. Poletto and V. Sarkar. Linear scan register allocation.  


[34] D. Shin and J. Kim. An operation rearrangement technique for low-power vliw instruction fetch.  


[38] C. Wimmer and M. Franz. Linear scan register allocation on ssa form. In
Proceedings of the 8th annual IEEE/ACM international symposium on Code
Appendix A

MiBench Benchmark Information

A.1 Introduction

The benchmark used for this thesis was MiBench version 1.0 [22] which originated from the Electrical Engineering and Computer Science Department at the University of Michigan, Ann Arbor. All benchmarks were accessed from http://www.eecs.umich.edu/mibench/index.html on 10/2/2010. Output was verified with sample output downloaded from the same site.

All benchmarks were compiled using the -mthumb switch using LLVM frontend CLANG on an ARM Cortex-A8 microprocessor running on a BeagleBoard. Full results are listed below of the results of attempted compilation using CLANG running on Ubuntu 10.04.1 LTS.
### A.2 Compilation

<table>
<thead>
<tr>
<th>Name</th>
<th>In Results</th>
<th>Compiled</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>basicmath</td>
<td>no</td>
<td>yes</td>
<td>Outputs differed due to FP rounding</td>
</tr>
<tr>
<td>bitcount</td>
<td>no</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>qsort</td>
<td>yes</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>susan</td>
<td>no</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>jpeg</td>
<td>no</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>lame</td>
<td>no</td>
<td>no</td>
<td>termcap.h: No such file or directory</td>
</tr>
<tr>
<td>mad</td>
<td>no</td>
<td>no</td>
<td>unrecognized command line option &quot;-fforce-mem&quot;</td>
</tr>
<tr>
<td>tiff2bw</td>
<td>no</td>
<td>no</td>
<td>Test for tiff-v3.5.4 only; no compilation</td>
</tr>
<tr>
<td>tiff2rgba</td>
<td>no</td>
<td>no</td>
<td>Test for tiff-v3.5.4 only; no compilation</td>
</tr>
<tr>
<td>tiff-data</td>
<td>no</td>
<td>no</td>
<td>Data for tiff-v3.5.4 only; no compilation</td>
</tr>
<tr>
<td>tiffdither</td>
<td>no</td>
<td>no</td>
<td>Test for tiff-v3.5.4 only; no compilation</td>
</tr>
<tr>
<td>tifmedian</td>
<td>no</td>
<td>no</td>
<td>Test for tiff-v3.5.4 only; no compilation</td>
</tr>
<tr>
<td>tiff-v3.5.4</td>
<td>no</td>
<td>no</td>
<td>tif_luv.c: undefined reference to 'log'</td>
</tr>
<tr>
<td>typeset</td>
<td>no</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>dijkstra</td>
<td>yes</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>patricia</td>
<td>no</td>
<td>yes</td>
<td>Outputs differed, but that may be normal</td>
</tr>
<tr>
<td>ghostscript</td>
<td>no</td>
<td>no</td>
<td>macro &quot;dprintf&quot; passed 3 arguments, but takes just 1</td>
</tr>
<tr>
<td>ispell</td>
<td>no</td>
<td>no</td>
<td>correct.c: conflicting types for `getline'</td>
</tr>
<tr>
<td>rsynth</td>
<td>no</td>
<td>no</td>
<td>Could not configure. Need to specify system type.</td>
</tr>
<tr>
<td>sphinx</td>
<td>no</td>
<td>no</td>
<td>blk_cdcn_norm.c: invalid storage class for function `block_actual_cdcn_norm'</td>
</tr>
<tr>
<td>stringsearch</td>
<td>yes</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>blowfish</td>
<td>no</td>
<td>no</td>
<td>Seems to compile correctly, but segfaults when run.</td>
</tr>
<tr>
<td>pgp</td>
<td>no</td>
<td>no</td>
<td>make reported “nothing to be done for ‘all’.” during first compile attempt.</td>
</tr>
<tr>
<td>rijndael</td>
<td>no</td>
<td>no</td>
<td>aexam.c: aggregate value used where an integer was expected</td>
</tr>
<tr>
<td>sha</td>
<td>yes</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>adpcm</td>
<td>no</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>CRC32</td>
<td>no</td>
<td>yes</td>
<td>None</td>
</tr>
<tr>
<td>FFT</td>
<td>no</td>
<td>yes</td>
<td>Outputs differed, but that may be normal.</td>
</tr>
<tr>
<td>gsm</td>
<td>no</td>
<td>yes</td>
<td>None</td>
</tr>
</tbody>
</table>

Table A.1: Compilation results of benchmarks
Appendix B

Full Test Results

B.1 Introduction

We include here the full dataset generated via compilation during the test of our optimization. This raw data is broken down by basic block, and includes the counts of the different categories of ARM instruction relevant to this analysis. The categories we include are single load instructions (ex: ldr), multiple load instructions (ex: ldmi), single stores (ex: str), multiple stores (ex: stmia), and all other instructions. Each Table analyzes a different source file used to compare the optimizations, and looks at the resultant assembly code when compiled without any memory optimizations (denoted “O0 none”), without moving instructions before combining into multiple memory instructions (“O3 nopreorpost”), using the default pre-register allocation optimization (“O3 pre_ra”), and using our new post-register allocation optimization (“O3 MagPs”). Total counts are also given to make com-
parison easier across files.

B.2 Data

Each Table corresponds to a single source file (ex: dijkstra_large.c); some benchmarks use multiple source files, such as the sha benchmark which uses sha.c and sha_driver.c.
<table>
<thead>
<tr>
<th>O0 none/O3 none/O3 nopre/post/O3 pre ra/O3 MagPs</th>
<th>LOAD_SINGLE</th>
<th>LOAD_MULTIPLE</th>
<th>STORE_SINGLE</th>
<th>STORE_MULTIPLE</th>
<th>OTHER</th>
<th>TOTAL</th>
</tr>
</thead>
<tbody>
<tr>
<td>BB 1 / BB 1 / BB 1</td>
<td>2 / 10 / 10 / 10</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>7 / 11 / 11 / 11</td>
<td>14 / 24 / 24 / 24</td>
</tr>
<tr>
<td>BB 2 / BB 2 / BB 2 / BB 2</td>
<td>3 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 1 / 1 / 1</td>
<td>5 / 1 / 1 / 1</td>
<td></td>
</tr>
<tr>
<td>BB 3 / BB 3 / BB 3 / BB 3</td>
<td>4 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 1 / 1 / 1</td>
<td>8 / 14 / 14 / 14</td>
</tr>
<tr>
<td>BB 5 / BB 5 / BB 5 / BB 5</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 3 / 3 / 3</td>
<td></td>
</tr>
<tr>
<td>BB 6 / BB 6 / BB 6 / BB 6</td>
<td>6 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>12 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 2 / 2 / 2 24 / 3 / 3 / 3</td>
<td>11 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 7 / BB 7 / BB 7 / BB 7</td>
<td>5 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 2 / 2 / 2</td>
<td></td>
</tr>
<tr>
<td>BB 8 / BB 8 / BB 8 / BB 8</td>
<td>8 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 0 / 0 / 0</td>
<td></td>
</tr>
<tr>
<td>BB 9 / BB 9 / BB 9 / BB 9</td>
<td>2 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 3 / 3 / 3</td>
<td>4 / 11 / 11 / 11</td>
</tr>
<tr>
<td>BB 10 / BB 10 / BB 10 / BB 10</td>
<td>0 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 5 / 5 / 5</td>
<td>1 / 8 / 8 / 8</td>
<td></td>
</tr>
<tr>
<td>BB 11 / BB 11 / BB 11 / BB 11</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td></td>
</tr>
<tr>
<td>BB 12 / BB 12 / BB 12 / BB 12</td>
<td>2 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 3 / 3 / 3</td>
<td>4 / 11 / 11 / 11</td>
</tr>
<tr>
<td>BB 13 / BB 13 / BB 13 / BB 13</td>
<td>2 / 14 / 14 / 14</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 2 / 2 / 2</td>
<td>3 / 21 / 21 / 21</td>
</tr>
<tr>
<td>BB 14 / BB 14 / BB 14 / BB 14</td>
<td>2 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>4 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 15 / BB 15 / BB 15 / BB 15</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>4 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 16 / BB 16 / BB 16 / BB 16</td>
<td>0 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 3 / 3 / 3</td>
<td>1 / 8 / 8 / 8</td>
</tr>
<tr>
<td>BB 17 / BB 17 / BB 17 / BB 17</td>
<td>3 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>8 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 1 / 1 / 1</td>
<td>15 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 18 / BB 18 / BB 18 / BB 18</td>
<td>1 / 5 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>2 / 5 / 5 / 5</td>
<td>23 / 11 / 11 / 11</td>
</tr>
<tr>
<td>BB 19 / BB 19 / BB 19 / BB 19</td>
<td>1 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 2 / 2 / 2</td>
<td>4 / 8 / 8 / 8</td>
</tr>
<tr>
<td>BB 20 / BB 20 / BB 20 / BB 20</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>1 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 21 / BB 21 / BB 21 / BB 21</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>5 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 22 / BB 22 / BB 22 / BB 22</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 23 / BB 23 / BB 23 / BB 23</td>
<td>0 / 8 / 8 / 8</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 7 / 7 / 7</td>
<td>1 / 17 / 17 / 17</td>
</tr>
<tr>
<td>BB 24 / BB 24 / BB 24 / BB 24</td>
<td>1 / 1 / 1 / 1</td>
<td>7 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 5 / 5 / 5</td>
<td>12 / 6 / 6 / 6</td>
</tr>
<tr>
<td>BB 25 / BB 25 / BB 25 / BB 25</td>
<td>5 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 3 / 3 / 3</td>
<td>12 / 7 / 7 / 7</td>
</tr>
<tr>
<td>BB 26 / BB 26 / BB 26 / BB 26</td>
<td>2 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>4 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 27 / BB 27 / BB 27 / BB 27</td>
<td>2 / 4 / 4 / 4</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 3 / 3 / 3</td>
<td>4 / 7 / 7 / 7</td>
</tr>
<tr>
<td>BB 28 / BB 28 / BB 28 / BB 28</td>
<td>1 / 9 / 9 / 9</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 4 / 4 / 4</td>
<td>4 / 15 / 15 / 15</td>
</tr>
<tr>
<td>BB 29 / BB 29 / BB 29 / BB 29</td>
<td>4 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 1 / 1 / 1</td>
<td>12 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 30 / BB 30 / BB 30 / BB 30</td>
<td>4 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 2 / 2 / 2</td>
<td>8 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 31 / BB 31 / BB 31 / BB 31</td>
<td>7 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 3 / 3 / 3</td>
<td>14 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 32 / BB 32 / BB 32 / BB 32</td>
<td>4 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 6 / 6 / 6</td>
<td>8 / 11 / 11 / 11</td>
</tr>
<tr>
<td>BB 33 / BB 33 / BB 33 / BB 33</td>
<td>8 / 1 / 1 / 1</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>11 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 34 / BB 34 / BB 34 / BB 34</td>
<td>15 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 1 / 1 / 1</td>
<td>22 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 35 / BB 35 / BB 35 / BB 35</td>
<td>2 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>0 / 1 / 1 / 1</td>
<td>1 / 3 / 3 / 3</td>
<td>4 / 7 / 7 / 7</td>
</tr>
<tr>
<td>BB 36 / BB 36 / BB 36 / BB 36</td>
<td>2 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 6 / 6 / 6</td>
<td>4 / 11 / 11 / 11</td>
</tr>
<tr>
<td>BB 37 / BB 37 / BB 37 / BB 37</td>
<td>0 / 8 / 8 / 8</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 7 / 7 / 7</td>
<td>3 / 17 / 17 / 17</td>
</tr>
<tr>
<td>BB 38 / BB 38 / BB 38 / BB 38</td>
<td>7 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 1 / 1 / 1</td>
<td>16 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 39 / BB 39 / BB 39 / BB 39</td>
<td>3 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 5 / 5 / 5</td>
<td>6 / 11 / 11 / 11</td>
</tr>
<tr>
<td>BB 40 / BB 40 / BB 40 / BB 40</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>1 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 41 / BB 41 / BB 41 / BB 41</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 1 / 1 / 1</td>
<td>10 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 42 / BB 42 / BB 42 / BB 42</td>
<td>7 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 2 / 2 / 2</td>
<td>16 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 43 / BB 43 / BB 43 / BB 43</td>
<td>6 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 4 / 4 / 4</td>
<td>14 / 5 / 5 / 5</td>
</tr>
<tr>
<td>BB 44 / BB 44 / BB 44 / BB 44</td>
<td>0 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 8 / 8 / 8</td>
<td>3 / 15 / 15 / 15</td>
</tr>
<tr>
<td>BB 45 / BB 45 / BB 45 / BB 45</td>
<td>7 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 2 / 2 / 2</td>
<td>15 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 46 / BB 46 / BB 46 / BB 46</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 2 / 2 / 2</td>
</tr>
<tr>
<td><strong>Total / Total / Total / Total</strong></td>
<td><strong>160 / 123 / 123 / 123</strong></td>
<td><strong>0 / 3 / 3 / 3</strong></td>
<td><strong>92 / 53 / 53 / 53</strong></td>
<td><strong>0 / 2 / 2 / 2</strong></td>
<td><strong>149 / 138 / 138 / 138</strong></td>
<td><strong>401 / 319 / 319 / 319</strong></td>
</tr>
</tbody>
</table>

Table B.1: Categories of instructions in dijkstra_large by basic block
<table>
<thead>
<tr>
<th>Basic Block</th>
<th>LOAD_SINGLE</th>
<th>LOAD_MULTIPLE</th>
<th>STORE_SINGLE</th>
<th>STORE_MULTIPLE</th>
<th>OTHER</th>
<th>TOTAL</th>
</tr>
</thead>
<tbody>
<tr>
<td>BB 1 / BB 1 / BB 1</td>
<td>11 / 8 / 8 / 8</td>
<td>0 / 0 / 0 / 0</td>
<td>13 / 4 / 4 / 4</td>
<td>0 / 2 / 2 / 2</td>
<td>8 / 8 / 6 / 6</td>
<td>32 / 20 / 20 / 20</td>
</tr>
<tr>
<td>BB 2 / BB 2 / BB 2</td>
<td>4 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 3 / 3 / 3</td>
<td>3 / 3 / 3 / 3</td>
<td>7 / 7 / 7 / 7</td>
</tr>
<tr>
<td>BB 3 / BB 3 / BB 3 / BB 3</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>2 / 2 / 2 / 2</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 4 / BB 4 / BB 4 / BB 4</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>1 / 1 / 1 / 1</td>
<td>3 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 5 / BB 5 / BB 5 / BB 5</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 3 / 3 / 3</td>
<td>1 / 2 / 2 / 2</td>
<td>1 / 2 / 2 / 2</td>
<td>5 / 5 / 5 / 5</td>
</tr>
<tr>
<td>BB 6 / BB 6 / BB 6 / BB 6</td>
<td>0 / 7 / 7 / 7</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 5 / 5 / 5</td>
<td>1 / 5 / 5 / 5</td>
<td>2 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 7 / BB 7 / BB 7 / BB 7</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 6 / 6 / 6</td>
<td>3 / 6 / 6 / 6</td>
<td>3 / 6 / 6 / 6</td>
<td>7 / 7 / 7 / 7</td>
</tr>
<tr>
<td>BB 8 / BB 8 / BB 8 / BB 8</td>
<td>3 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 9 / BB 9 / BB 9 / BB 9</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 10 / BB 10 / BB 10 / BB 10</td>
<td>3 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 11 / BB 11 / BB 11 / BB 11</td>
<td>6 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 12 / BB 12 / BB 12 / BB 12</td>
<td>28 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 0 / 0 / 0</td>
<td>2 / 0 / 0 / 0</td>
<td>2 / 0 / 0 / 0</td>
<td>6 / 6 / 6 / 6</td>
</tr>
<tr>
<td>BB 13 / BB 13 / BB 13 / BB 13</td>
<td>2 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 6 / 6 / 6</td>
</tr>
<tr>
<td>BB 14 / BB 14 / BB 14 / BB 14</td>
<td>2 / 3 / 3 / 2</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 3 / 3 / 3</td>
<td>1 / 1 / 1 / 1</td>
<td>1 / 1 / 1 / 1</td>
<td>6 / 6 / 6 / 6</td>
</tr>
<tr>
<td>BB 15 / BB 15 / BB 15 / BB 15</td>
<td>2 / 3 / 3 / 2</td>
<td>0 / 0 / 0 / 1</td>
<td>0 / 0 / 0 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 6 / 6 / 6</td>
</tr>
<tr>
<td>BB 16 / BB 16 / BB 16 / BB 16</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 17 / BB 17 / BB 17 / BB 17</td>
<td>4 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 18 / BB 18 / BB 18 / BB 18</td>
<td>6 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>1 / 1 / 1 / 1</td>
<td>1 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 19 / BB 19 / BB 19 / BB 19</td>
<td>2 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 20 / BB 20 / BB 20 / BB 20</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 21 / BB 21 / BB 21 / BB 21</td>
<td>4 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
</tr>
</tbody>
</table>

| Total / Total / Total / Total | 81 / 50 / 50 / 51 | 0 / 4 / 4 / 4 | 56 / 21 / 21 / 21 | 0 / 3 / 3 / 3 | 96 / 107 / 107 / 109 | 233 / 185 / 185 / 188 |

Table B.2: Categories of instructions in qsort large by basic block
<table>
<thead>
<tr>
<th>O0 none/O3 none/orpost/O3 pre_ra/O3 MagPs</th>
<th>LOAD_SINGLE</th>
<th>LOAD_MULTIPLE</th>
<th>STORE_SINGLE</th>
<th>STORE_MULTIPLE</th>
<th>OTHER</th>
<th>TOTAL</th>
</tr>
</thead>
<tbody>
<tr>
<td>BB 1 / BB 1 / BB 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 3 / 3 / 3</td>
<td>8 / 5 / 5 / 5</td>
</tr>
<tr>
<td>BB 2 / BB 2 / BB 2 / BB 2</td>
<td>4 / 4 / 4 / 4</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 3 / 3 / 3</td>
<td>9 / 8 / 8 / 8</td>
</tr>
<tr>
<td>BB 3 / BB 3 / BB 3 / BB 3</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 4 / BB 4 / BB 4 / BB 4</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 5 / BB 5 / BB 5 / BB 5</td>
<td>6 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 6 / 6 / 6</td>
<td>14 / 13 / 13 / 13</td>
</tr>
<tr>
<td>BB 6 / BB 6 / BB 6 / BB 6</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 7 / BB 7 / BB 7 / BB 7</td>
<td>10 / 10 / 10 / 10</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>18 / 17 / 17 / 17</td>
</tr>
<tr>
<td>BB 8 / BB 8 / BB 8 / BB 8</td>
<td>14 / 14 / 14 / 14</td>
<td>0 / 0 / 0 / 0</td>
<td>7 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>10 / 10 / 10 / 10</td>
<td>31 / 30 / 30 / 30</td>
</tr>
<tr>
<td>BB 9 / BB 9 / BB 9 / BB 9</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 10 / BB 10 / BB 10 / BB 10</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 11 / BB 11 / BB 11 / BB 11</td>
<td>14 / 14 / 14 / 14</td>
<td>0 / 0 / 0 / 0</td>
<td>7 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>9 / 9 / 9 / 9</td>
<td>30 / 29 / 29 / 29</td>
</tr>
<tr>
<td>BB 12 / BB 12 / BB 12 / BB 12</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 13 / BB 13 / BB 13 / BB 13</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 14 / BB 14 / BB 14 / BB 14</td>
<td>14 / 14 / 14 / 14</td>
<td>0 / 0 / 0 / 0</td>
<td>7 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>11 / 11 / 11 / 11</td>
<td>32 / 31 / 31 / 31</td>
</tr>
<tr>
<td>BB 15 / BB 15 / BB 15 / BB 15</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 16 / BB 16 / BB 16 / BB 16</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 17 / BB 17 / BB 17 / BB 17</td>
<td>14 / 14 / 14 / 14</td>
<td>0 / 0 / 0 / 0</td>
<td>7 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>9 / 9 / 9 / 9</td>
<td>30 / 29 / 29 / 29</td>
</tr>
<tr>
<td>BB 18 / BB 18 / BB 18 / BB 18</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 19 / BB 19 / BB 19 / BB 19</td>
<td>15 / 16 / 16 / 16</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 7 / 7 / 7</td>
<td>25 / 28 / 28 / 28</td>
</tr>
<tr>
<td>BB 20 / BB 20 / BB 20 / BB 20</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>4 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 21 / BB 21 / BB 21 / BB 21</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 4 / 4 / 4</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 4 / 4 / 4</td>
<td>1 / 9 / 9 / 9</td>
</tr>
<tr>
<td>BB 22 / BB 22 / BB 22 / BB 22</td>
<td>1 / 17 / 17 / 17</td>
<td>0 / 0 / 0 / 0</td>
<td>7 / 9 / 9 / 9</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 2 / 2 / 2</td>
<td>12 / 28 / 28 / 28</td>
</tr>
<tr>
<td>BB 23 / BB 23 / BB 23 / BB 23</td>
<td>17 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>10 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 5 / 5 / 5</td>
<td>29 / 8 / 8 / 8</td>
</tr>
<tr>
<td>BB 24 / BB 24 / BB 24 / BB 24</td>
<td>2 / 11 / 11 / 11</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 8 / 8 / 8</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 3 / 3 / 4</td>
<td>4 / 22 / 22 / 23</td>
</tr>
<tr>
<td>BB 25 / BB 25 / BB 25 / BB 25</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>3 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 26 / BB 26 / BB 26 / BB 26</td>
<td>11 / 10 / 10 / 10</td>
<td>0 / 0 / 0 / 0</td>
<td>9 / 7 / 7 / 7</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 7 / 7 / 7</td>
<td>22 / 24 / 24 / 24</td>
</tr>
<tr>
<td>BB 27 / BB 27 / BB 27 / BB 27</td>
<td>0 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 9 / 9 / 9</td>
<td>2 / 17 / 17 / 17</td>
</tr>
<tr>
<td>BB 28 / BB 28 / BB 28 / BB 28</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>1 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 29 / BB 29 / BB 29 / BB 29</td>
<td>2 / 4 / 4 / 4</td>
<td>0 / 0 / 0 / 0</td>
<td>7 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 5 / 5 / 5</td>
<td>14 / 9 / 9 / 9</td>
</tr>
<tr>
<td>Basic Block</td>
<td>2 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 7 / 7 / 7</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 8 / 8 / 8</td>
<td>4 / 21 / 21 / 21</td>
</tr>
<tr>
<td>------------</td>
<td>--------------</td>
<td>--------------</td>
<td>--------------</td>
<td>--------------</td>
<td>--------------</td>
<td>----------------</td>
</tr>
<tr>
<td>BB 31</td>
<td>6 / 5 / 5 / 5</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 11 / 11 / 11</td>
<td>11 / 16 / 16 / 16</td>
</tr>
<tr>
<td>BB 32</td>
<td>7 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>8 / 2 / 2 / 2</td>
<td>18 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 33</td>
<td>1 / 7 / 7 / 7</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 10 / 10 / 10</td>
<td>3 / 19 / 19 / 19</td>
</tr>
<tr>
<td>BB 34</td>
<td>3 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 1 / 1 / 1</td>
<td>2 / 5 / 5 / 5</td>
<td>5 / 9 / 9 / 9</td>
</tr>
<tr>
<td>BB 35</td>
<td>1 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 3 / 3 / 3</td>
<td>5 / 5 / 5 / 5</td>
</tr>
<tr>
<td>BB 36</td>
<td>6 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>8 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>9 / 7 / 7 / 7</td>
<td>23 / 9 / 9 / 9</td>
</tr>
<tr>
<td>BB 37</td>
<td>7 / 1 / 1 / 1</td>
<td>0 / 1 / 1 / 1</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>14 / 5 / 5 / 5</td>
<td>23 / 7 / 7 / 7</td>
</tr>
<tr>
<td>BB 38</td>
<td>2 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 4 / 4 / 4</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 4 / 4 / 4</td>
<td>7 / 12 / 12 / 12</td>
</tr>
</tbody>
</table>

Table B.3: Categories of instructions in sha by basic block
<table>
<thead>
<tr>
<th>BB 1 / BB 1 / BB 1 / BB 1</th>
<th>LOAD_SINGLE</th>
<th>LOAD_MULTIPLE</th>
<th>STORE_SINGLE</th>
<th>STORE_MULTIPLE</th>
<th>OTHER</th>
<th>TOTAL</th>
</tr>
</thead>
<tbody>
<tr>
<td>BB 1 / BB 1 / BB 1 / BB 1</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 2 / 2 / 2</td>
<td>0 / 1 / 1 / 1</td>
<td>6 / 5 / 5 / 5</td>
<td>13 / 9 / 9 / 9</td>
</tr>
<tr>
<td>BB 2 / BB 2 / BB 2 / BB 2</td>
<td>4 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 1 / 1 / 1</td>
<td>12 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 3 / BB 3 / BB 3 / BB 3</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 6 / 6 / 6</td>
<td>1 / 9 / 9 / 9</td>
</tr>
<tr>
<td>BB 4 / BB 4 / BB 4 / BB 4</td>
<td>3 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 5 / 5 / 5</td>
<td>9 / 9 / 9 / 9</td>
</tr>
<tr>
<td>BB 5 / BB 5 / BB 5 / BB 5</td>
<td>3 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 3 / 3 / 3</td>
<td>6 / 5 / 5 / 5</td>
</tr>
<tr>
<td>BB 6 / BB 6 / BB 6 / BB 6</td>
<td>3 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>4 / 6 / 6 / 6</td>
<td>9 / 8 / 8 / 8</td>
</tr>
<tr>
<td>BB 7 / BB 7 / BB 7 / BB 7</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 2 / 2 / 2</td>
<td>5 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 8 / BB 8 / BB 8 / BB 8</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 1 / 1 / 1</td>
<td>2 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>3 / 6 / 6 / 6</td>
</tr>
<tr>
<td>Total / Total / Total / Total</td>
<td>18 / 13 / 13 / 13</td>
<td>0 / 1 / 1 / 1</td>
<td>17 / 8 / 8 / 8</td>
<td>0 / 1 / 1 / 1</td>
<td>29 / 30 / 30 / 30</td>
<td>64 / 53 / 53 / 53</td>
</tr>
</tbody>
</table>

Table B.4: Categories of instructions in sha_driver by basic block
<table>
<thead>
<tr>
<th>O0 none/O3 nor</th>
<th>load</th>
<th>load mut</th>
<th>store</th>
<th>store mut</th>
<th>other</th>
<th>total</th>
</tr>
</thead>
<tbody>
<tr>
<td>BB 1 / BB 1</td>
<td>5 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>8 / 4 / 4</td>
<td>0 / 1 / 1</td>
<td>5 / 6 / 6</td>
<td>18 / 15 / 15</td>
</tr>
<tr>
<td>BB 2 / BB 2</td>
<td>3 / 3 / 3</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>9 / 5 / 5</td>
</tr>
<tr>
<td>BB 3 / BB 3</td>
<td>6 / 6 / 6</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>2 / 2 / 2</td>
<td>10 / 8 / 8</td>
</tr>
<tr>
<td>BB 4 / BB 4</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>3 / 1 / 1</td>
</tr>
<tr>
<td>BB 5 / BB 5</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 3 / 3</td>
<td>3 / 5 / 5</td>
</tr>
<tr>
<td>BB 6 / BB 6</td>
<td>1 / 8 / 8</td>
<td>0 / 0 / 0</td>
<td>0 / 3 / 3</td>
<td>0 / 0 / 0</td>
<td>2 / 6 / 6</td>
<td>3 / 17 / 17</td>
</tr>
<tr>
<td>BB 7 / BB 7</td>
<td>5 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>2 / 2 / 2</td>
<td>8 / 3 / 3</td>
</tr>
<tr>
<td>BB 8 / BB 8</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>2 / 2 / 2</td>
<td>6 / 4 / 4</td>
</tr>
<tr>
<td>BB 9 / BB 9</td>
<td>2 / 10 / 10</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 5 / 5</td>
<td>5 / 16 / 15</td>
</tr>
<tr>
<td>BB 10 / BB 10</td>
<td>1 / 2 / 2</td>
<td>0 / 1 / 1</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 4 / 4</td>
<td>3 / 8 / 8</td>
</tr>
<tr>
<td>BB 11 / BB 11</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>3 / 1 / 1</td>
</tr>
<tr>
<td>BB 12 / BB 12</td>
<td>3 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>2 / 3 / 3</td>
<td>0 / 1 / 1</td>
<td>2 / 5 / 5</td>
<td>7 / 11 / 11</td>
</tr>
<tr>
<td>BB 13 / BB 13</td>
<td>10 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>3 / 3 / 3</td>
<td>13 / 8 / 8</td>
</tr>
<tr>
<td>BB 14 / BB 14</td>
<td>4 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 3 / 3</td>
<td>7 / 8 / 8</td>
</tr>
<tr>
<td>BB 15 / BB 15</td>
<td>1 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>1 / 3 / 3</td>
<td>3 / 5 / 5</td>
</tr>
<tr>
<td>BB 16 / BB 16</td>
<td>3 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>0 / 3 / 3</td>
<td>0 / 0 / 0</td>
<td>3 / 5 / 5</td>
<td>6 / 12 / 12</td>
</tr>
<tr>
<td>BB 17 / BB 17</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>3 / 2 / 2</td>
<td>4 / 4 / 4</td>
</tr>
<tr>
<td>BB 18 / BB 18</td>
<td>0 / 7 / 7</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>1 / 2 / 2</td>
<td>1 / 9 / 9</td>
</tr>
<tr>
<td>BB 19 / BB 19</td>
<td>2 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>5 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>5 / 2 / 2</td>
<td>12 / 3 / 3</td>
</tr>
<tr>
<td>BB 20 / BB 20</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>3 / 2 / 2</td>
</tr>
<tr>
<td>BB 21 / BB 21</td>
<td>2 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 3 / 3</td>
<td>4 / 6 / 6</td>
</tr>
<tr>
<td>BB 22 / BB 22</td>
<td>5 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>3 / 1 / 1</td>
<td>9 / 1 / 1</td>
</tr>
<tr>
<td>BB 23 / BB 23</td>
<td>2 / 2 / 2</td>
<td>0 / 1 / 1</td>
<td>0 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>5 / 1 / 1</td>
<td>7 / 6 / 6</td>
</tr>
<tr>
<td>BB 24 / BB 24</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>3 / 1 / 1</td>
</tr>
<tr>
<td>BB 25 / BB 25</td>
<td>5 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>3 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>5 / 1 / 1</td>
<td>13 / 1 / 1</td>
</tr>
<tr>
<td>BB 26 / BB 26</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>3 / 1 / 1</td>
<td>5 / 1 / 1</td>
</tr>
<tr>
<td>BB 27 / BB 27</td>
<td>9 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>11 / 1 / 1</td>
</tr>
<tr>
<td>BB 28 / BB 28</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>3 / 1 / 1</td>
</tr>
<tr>
<td>BB 29 / BB 29</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>3 / 1 / 1</td>
</tr>
<tr>
<td>BB 32 / - / - / -</td>
<td>0 / - / -</td>
<td>0 / - / -</td>
<td>0 / - / -</td>
<td>0 / - / -</td>
<td>1 / - / -</td>
<td>1 / - / -</td>
</tr>
<tr>
<td>Total / Total / Total / Total</td>
<td>88 / 67 / 65 / 65</td>
<td>0 / 2 / 3 / 3</td>
<td>35 / 24 / 24 / 24</td>
<td>0 / 2 / 2 / 2</td>
<td>78 / 63 / 63 / 63</td>
<td>201 / 158 / 157 / 157</td>
</tr>
</tbody>
</table>

Table B.5: Categories of instructions in bmhasrch by basic block
<table>
<thead>
<tr>
<th>O0 none/O3 noneprepost/O3 pre ra/O3 MagPs</th>
<th>LOAD_SINGLE</th>
<th>LOAD_MULTIPLE</th>
<th>STORE_SINGLE</th>
<th>STORE_MULTIPLE</th>
<th>OTHER</th>
<th>TOTAL</th>
</tr>
</thead>
<tbody>
<tr>
<td>BB 1 / BB 1 / BB 1 / BB 1</td>
<td>6 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>8 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>6 / 6 / 6 / 6</td>
<td>20 / 12 / 12 / 12</td>
</tr>
<tr>
<td>BB 2 / BB 2 / BB 2 / BB 2</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 3 / 3 / 3</td>
<td>2 / 4 / 4 / 4</td>
<td>2 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 3 / BB 3 / BB 3 / BB 3</td>
<td>1 / 4 / 4 / 4</td>
<td>0 / 1 / 1 / 1</td>
<td>2 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 2 / 2 / 2</td>
<td>6 / 8 / 8 / 8</td>
</tr>
<tr>
<td>BB 4 / BB 4 / BB 4 / BB 4</td>
<td>8 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>13 / 5 / 5 / 5</td>
</tr>
<tr>
<td>BB 5 / BB 5 / BB 5 / BB 5</td>
<td>3 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>5 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 6 / BB 6 / BB 6 / BB 6</td>
<td>0 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 1 / 1 / 1</td>
<td>3 / 6 / 6 / 6</td>
</tr>
<tr>
<td>BB 7 / BB 7 / BB 7 / BB 7</td>
<td>5 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 2 / 2 / 2</td>
<td>8 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 8 / BB 8 / BB 8 / BB 8</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 9 / BB 9 / BB 9 / BB 9</td>
<td>0 / 10 / 8 / 8</td>
<td>0 / 0 / 1 / 1</td>
<td>1 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 6 / 6 / 6</td>
<td>3 / 18 / 17 / 17</td>
</tr>
<tr>
<td>BB 10 / BB 10 / BB 10 / BB 10</td>
<td>15 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 3 / 3 / 3</td>
<td>26 / 6 / 6 / 6</td>
</tr>
<tr>
<td>BB 11 / BB 11 / BB 11 / BB 11</td>
<td>3 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 4 / 4 / 4</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 6 / 6 / 6</td>
<td>6 / 14 / 14 / 14</td>
</tr>
<tr>
<td>BB 12 / BB 12 / BB 12 / BB 12</td>
<td>12 / 7 / 7 / 7</td>
<td>0 / 0 / 0 / 0</td>
<td>8 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 4 / 4 / 4</td>
<td>26 / 12 / 12 / 12</td>
</tr>
<tr>
<td>BB 13 / BB 13 / BB 13 / BB 13</td>
<td>5 / 2 / 2 / 2</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 4 / 4 / 4</td>
<td>7 / 8 / 8 / 8</td>
</tr>
<tr>
<td>BB 14 / BB 14 / BB 14 / BB 14</td>
<td>4 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>7 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 15 / BB 15 / BB 15 / BB 15</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>3 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 16 / BB 16 / BB 16 / BB 16</td>
<td>3 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>3 / 5 / 5 / 5</td>
<td>6 / 11 / 11 / 11</td>
</tr>
<tr>
<td>BB 17 / BB 17 / BB 17 / BB 17</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 2 / 2 / 2</td>
<td>4 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 18 / BB 18 / BB 18 / BB 18</td>
<td>0 / 3 / 3 / 3</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 3 / 3 / 3</td>
<td>1 / 7 / 7 / 7</td>
</tr>
<tr>
<td>BB 19 / BB 19 / BB 19 / BB 19</td>
<td>2 / 4 / 4 / 4</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>6 / 3 / 3 / 3</td>
<td>14 / 8 / 8 / 8</td>
</tr>
<tr>
<td>BB 20 / BB 20 / BB 20 / BB 20</td>
<td>0 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 3 / 3 / 3</td>
<td>3 / 5 / 5 / 5</td>
</tr>
<tr>
<td>BB 21 / BB 21 / BB 21 / BB 21</td>
<td>2 / 4 / 4 / 4</td>
<td>0 / 1 / 1 / 1</td>
<td>1 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 5 / 5 / 5</td>
<td>4 / 12 / 12 / 12</td>
</tr>
<tr>
<td>BB 22 / BB 22 / BB 22 / BB 22</td>
<td>5 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 2 / 2 / 2</td>
<td>9 / 4 / 4 / 4</td>
</tr>
<tr>
<td>BB 23 / BB 23 / BB 23 / BB 23</td>
<td>2 / 6 / 6 / 6</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 3 / 3 / 3</td>
<td>7 / 9 / 9 / 9</td>
</tr>
<tr>
<td>BB 24 / BB 24 / BB 24 / BB 24</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 2 / 2 / 2</td>
<td>3 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 25 / BB 25 / BB 25 / BB 25</td>
<td>5 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>5 / 1 / 1 / 1</td>
<td>13 / 3 / 3 / 3</td>
</tr>
<tr>
<td>BB 26 / BB 26 / BB 26 / BB 26</td>
<td>1 / 2 / 2 / 2</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 3 / 3 / 3</td>
<td>5 / 6 / 6 / 6</td>
</tr>
<tr>
<td>BB 27 / BB 27 / BB 27 / BB 27</td>
<td>7 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 1 / 1 / 1</td>
<td>10 / 2 / 2 / 2</td>
</tr>
<tr>
<td>BB 28 / BB 28 / BB 28 / BB 28</td>
<td>1 / 2 / 2 / 2</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>2 / 1 / 1 / 1</td>
<td>3 / 5 / 5 / 5</td>
</tr>
<tr>
<td>BB 29 / BB 29 / BB 29 / BB 29</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 0 / 0 / 0</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>3 / 1 / 1 / 1</td>
</tr>
<tr>
<td>BB 30 / BB 30 / BB 30 / BB 30</td>
<td>3 / 3 / 3 / 3</td>
<td>0 / 0 / 0 / 0</td>
<td>1 / 1 / 1 / 1</td>
<td>0 / 0 / 0 / 0</td>
<td>3 / 2 / 2 / 2</td>
<td>7 / 6 / 6 / 6</td>
</tr>
<tr>
<td>-------------------------------</td>
<td>---------------</td>
<td>---------------</td>
<td>---------------</td>
<td>---------------</td>
<td>---------------</td>
<td>---------------</td>
</tr>
<tr>
<td>Total / Total / Total / Total</td>
<td>103 / 69 / 67 / 67</td>
<td>0 / 4 / 5 / 5</td>
<td>51 / 32 / 32 / 32</td>
<td>0 / 2 / 2 / 2</td>
<td>96 / 83 / 83 / 83</td>
<td>250 / 190 / 189 / 189</td>
</tr>
</tbody>
</table>

Table B.6: Categories of instructions in bmhisrch by basic block
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>STORE</td>
<td>18/12/12/12</td>
<td>5/4/4/4</td>
<td>0/11/1/1</td>
<td>8/4/4/4</td>
<td>0/10/0/0</td>
<td>0/20/0/0</td>
<td>0/3/3/3</td>
<td>0/3/3/3</td>
<td>0/3/3/3</td>
<td>0/3/3/3</td>
<td>0/3/3/3</td>
<td>0/3/3/3</td>
<td>0/3/3/3</td>
</tr>
<tr>
<td>LOAD SINGLE</td>
<td>2/2/2/2</td>
<td>5/7/7/7</td>
<td>2/1/1/1</td>
<td>4/1/1/1</td>
<td>2/1/1/1</td>
<td>2/1/1/1</td>
<td>3/1/1/1</td>
<td>3/1/1/1</td>
<td>3/1/1/1</td>
<td>3/1/1/1</td>
<td>3/1/1/1</td>
<td>3/1/1/1</td>
<td>3/1/1/1</td>
</tr>
<tr>
<td>LOAD MULTIPLE</td>
<td>3/3/3/3</td>
<td>6/11/11/11</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
<td>0/0/0/0</td>
</tr>
<tr>
<td>OTHER</td>
<td>2/1/1/1</td>
<td>2/1/1/1</td>
<td>2/1/1/1</td>
<td>3/1/1/1</td>
<td>3/0/0/0</td>
<td>0/0/0/0</td>
<td>0/1/1/1</td>
<td>0/1/1/1</td>
<td>0/1/1/1</td>
<td>0/1/1/1</td>
<td>0/1/1/1</td>
<td>0/1/1/1</td>
<td>0/1/1/1</td>
</tr>
</tbody>
</table>

120
| Total / Total / Total / Total | 74 / 53 / 51 / 51 | 0 / 2 / 3 / 3 | 36 / 26 / 26 / 26 | 0 / 1 / 1 / 1 | 71 / 60 / 60 / 60 | 181 / 142 / 141 / 141 |

Table B.7: Categories of instructions in bmhsrch by basic block
<table>
<thead>
<tr>
<th></th>
<th>LOAD_SINGLE</th>
<th>LOAD_MULTIPLE</th>
<th>STORE_SINGLE</th>
<th>STORE_MULTIPLE</th>
<th>OTHER</th>
<th>TOTAL</th>
</tr>
</thead>
<tbody>
<tr>
<td>BB 1 / BB 1 / BB 1</td>
<td>3 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>7 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>5 / 4 / 4</td>
<td>15 / 10 / 10</td>
</tr>
<tr>
<td>BB 2 / BB 2 / BB 2</td>
<td>5 / 3 / 3</td>
<td>0 / 0 / 0</td>
<td>2 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>8 / 6 / 6</td>
</tr>
<tr>
<td>BB 3 / BB 3 / BB 3</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>2 / 2 / 2</td>
<td>3 / 3 / 3</td>
</tr>
<tr>
<td>BB 4 / BB 4 / BB 4</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>2 / 2 / 2</td>
<td>3 / 2 / 2</td>
</tr>
<tr>
<td>BB 5 / BB 5 / BB 5</td>
<td>7 / 3 / 3</td>
<td>0 / 1 / 1</td>
<td>2 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>3 / 3 / 3</td>
<td>12 / 8 / 8</td>
</tr>
<tr>
<td>BB 6 / BB 6 / BB 6</td>
<td>3 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>0 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>2 / 4 / 4</td>
<td>5 / 10 / 10</td>
</tr>
<tr>
<td>BB 7 / BB 7 / BB 7</td>
<td>2 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>3 / 1 / 1</td>
</tr>
<tr>
<td>BB 8 / BB 8 / BB 8</td>
<td>1 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>0 / 3 / 3</td>
<td>0 / 1 / 1</td>
<td>3 / 4 / 4</td>
<td>4 / 12 / 12</td>
</tr>
<tr>
<td>BB 9 / BB 9 / BB 9</td>
<td>0 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>1 / 4 / 4</td>
</tr>
<tr>
<td>BB 10 / BB 10 / BB 10</td>
<td>4 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>6 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>5 / 2 / 2</td>
<td>15 / 4 / 4</td>
</tr>
<tr>
<td>BB 11 / BB 11 / BB 11</td>
<td>0 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 2 / 2</td>
<td>1 / 7 / 7</td>
</tr>
<tr>
<td>BB 12 / BB 12 / BB 12</td>
<td>2 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>1 / 2 / 2</td>
<td>4 / 3 / 3</td>
</tr>
<tr>
<td>BB 13 / BB 13 / BB 13</td>
<td>2 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 6 / 6</td>
<td>4 / 11 / 11</td>
</tr>
<tr>
<td>BB 14 / BB 14 / BB 14</td>
<td>5 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>8 / 3 / 3</td>
</tr>
<tr>
<td>BB 15 / BB 15 / BB 15</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>3 / 3 / 3</td>
</tr>
<tr>
<td>BB 16 / BB 16 / BB 16</td>
<td>7 / 2 / 2</td>
<td>0 / 0 / 0</td>
<td>2 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>6 / 2 / 2</td>
<td>15 / 4 / 4</td>
</tr>
<tr>
<td>BB 17 / BB 17 / BB 17</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>3 / 2 / 2</td>
</tr>
<tr>
<td>BB 18 / BB 18 / BB 18</td>
<td>1 / 2 / 2</td>
<td>0 / 1 / 1</td>
<td>1 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>1 / 1 / 1</td>
<td>3 / 5 / 5</td>
</tr>
<tr>
<td>BB 19 / BB 19 / BB 19</td>
<td>2 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>2 / 1 / 1</td>
<td>4 / 1 / 1</td>
</tr>
<tr>
<td>BB 20 / BB 20 / BB 20</td>
<td>0 / 6 / 6</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 1 / 1</td>
<td>1 / 15 / 15</td>
<td>2 / 2 / 2</td>
</tr>
<tr>
<td>BB 21 / BB 21 / BB 21</td>
<td>1 / 4 / 4</td>
<td>0 / 0 / 0</td>
<td>1 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 5 / 5</td>
<td>2 / 9 / 9</td>
</tr>
<tr>
<td>BB 22 / BB 22 / BB 22</td>
<td>2 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>3 / 3 / 3</td>
<td>5 / 4 / 4</td>
</tr>
<tr>
<td>BB 23 / BB 23 / BB 23</td>
<td>0 / 1 / 1</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>0 / 0 / 0</td>
<td>1 / 3 / 3</td>
<td>1 / 4 / 4</td>
</tr>
<tr>
<td>BB 24 / BB 24 / BB 24</td>
<td>5 / 9 / 9</td>
<td>0 / 1 / 1</td>
<td>6 / 6 / 6</td>
<td>0 / 0 / 0</td>
<td>14 / 12 / 12</td>
<td>25 / 28 / 28</td>
</tr>
<tr>
<td>BB 30 / / / /</td>
<td>BB 31 / / / /</td>
<td>BB 32 / / / /</td>
<td>BB 33 / / / /</td>
<td>Total / Total / Total / Total</td>
<td></td>
<td></td>
</tr>
<tr>
<td>----------------</td>
<td>----------------</td>
<td>----------------</td>
<td>----------------</td>
<td>---------------------------</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1 / / / /</td>
<td>0 / / / /</td>
<td>2 / / / /</td>
<td>0 / / / /</td>
<td>3 / / / /</td>
<td>6 / / / /</td>
<td></td>
</tr>
<tr>
<td>2 / / / /</td>
<td>0 / / / /</td>
<td>0 / / / /</td>
<td>0 / / / /</td>
<td>4 / / / /</td>
<td>6 / / / /</td>
<td></td>
</tr>
<tr>
<td>0 / / / /</td>
<td>0 / / / /</td>
<td>2 / / / /</td>
<td>0 / / / /</td>
<td>2 / / / /</td>
<td>4 / / / /</td>
<td></td>
</tr>
<tr>
<td>3 / / / /</td>
<td>0 / / / /</td>
<td>0 / / / /</td>
<td>0 / / / /</td>
<td>5 / / / /</td>
<td>8 / / / /</td>
<td></td>
</tr>
</tbody>
</table>

| Total / Total / Total / Total | 78 / 57 / 57 / 58 | 0 / 3 / 3 / 3 | 45 / 26 / 26 / 26 | 0 / 2 / 2 / 2 | 93 / 78 / 78 / 80 | 216 / 166 / 166 / 169 |

Table B.8: Categories of instructions in pbmsrch_large by basic block
Appendix C

Source Code

C.1 Introduction

We include all source code used in the post-register allocation optimization we developed below. Because our optimization was added as a helper to an existing optimization (ARMLoadStoreOpt), we elected to simply add our source code into the same file (ARMLoadStoreOptimizer.cpp) in a separate namespace. All of our final code was implemented in C++ in this file.

C.2 ARMLoadStoreOptimizer.cpp

The implementation challenges have been discussed in the implementation Chapter, however we wanted to include the full and rather lengthy source code for our optimization. We will be including both the optimization code, and also the modi-
fications we made to implement the optimization in the existing post-register allocation optimization in the ARMLoadStoreOpt pass.

The optimization code in Listing C.1 also includes code for logging what the pass is doing for debug purposes. Some of this logging code may be commented out, but was left in the code to aid debug of future issues.

### Listing C.1: Optimization Code

```cpp
namespace {
  using namespace std;

  class MemOpRecord {
    public:
      MachineInstr* OpLocation;
      MachineInstr* LowerBound;
      MachineInstr* UpperBound;

      MemOpRecord() : OpLocation(NULL), LowerBound(NULL), UpperBound (NULL) {};
  };

  class RegisterDependencies {
    public:
      vector<int> use_dep;
      vector<int> mod_dep;
  };

  struct ClusterPoint {
    MachineInstr* insertAfter;
  }
```

125
vector<MachineInstr*> instructionsToGather;

ClusterPoint() : insertAfter(NULL) {};

class MagnetPass {
    vector<MemOpRecord> AllMemOps;
    vector<RegisterDependencies> RegDeps;
    vector<int> MemDeps;
    unsigned numRegs;

    void clearAllDeps();
    int getAllMemOpsIndexOf(MachineInstr* MI);
    vector<int> getRegsUsed(MachineInstr* MI);
    vector<int> getRegsModified(MachineInstr* MI);
    void endLowerBoundUsingRegsUsed(MachineInstr* MI);
    void endLowerBoundUsingRegsModified(MachineInstr* MI);
    void endUpperBoundUsingRegsUsed(MachineInstr* MI);
    void endUpperBoundUsingRegsModified(MachineInstr* MI);
    void endLowerBoundUsingMem(MachineInstr* MI);
    void endUpperBoundUsingMem(MachineInstr* MI);
    ClusterPoint getBestRangeOverlap(MachineBasicBlock &MBB);
    bool allMIsContiguous(MachineBasicBlock &MBB, const TargetMachine &TM, ClusterPoint bestCluster);
    void gatherAtBestRangeOverlap(MachineBasicBlock &MBB, const TargetMachine &TM, ClusterPoint bestCluster);
    void printBB(MachineBasicBlock &MBB, const TargetMachine &TM, string s);
    void findLowerBounds(MachineBasicBlock &MBB, const TargetMachine &TM, ClusterPoint bestCluster);
TargetMachine &TM);

void findUpperBounds(MachineBasicBlock &MBB, const
    TargetMachine &TM);

void initialize(MachineBasicBlock &MBB, const TargetMachine &
    TM);

void cleanUp();

public:
    raw_os_ostream* errLog;

    MagnetPass() {
        //errLog = new raw_os_ostream(cerr); // Log progress to
        std::cerr
        errLog = NULL; // Disable logging
    }

    ~MagnetPass() {
        if (errLog)
            delete errLog;
    }

    void runOptimization(MachineBasicBlock &MBB, const
        TargetMachine &TM);
};

class BBPrinter {
public:
    raw_os_ostream* errLog;

    BBPrinter(bool before) {
        if (before) {
            errLog = new raw_os_ostream(cout);
        }
} else {
    errLog = new raw_os_ostream(cerr);
}

void printBB(MachineBasicBlock &MBB, const TargetMachine &TM);
void printInstr(MachineBasicBlock &MBB, const TargetMachine &
    TM, int index);

~BBPrinter() {
    delete errLog;
}

void BBPrinter::printBB(MachineBasicBlock &MBB, const
    TargetMachine &TM) {
    MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
    for (; MBBI != E; ++MBBI) {
        MachineInstr *MI = MBBI;
        *errLog << " " << MI << " : ";
        MI->print(*errLog, &TM);
    }
    *errLog << "\n";
    errLog->flush();
}

void BBPrinter::printInstr(MachineBasicBlock &MBB, const
    TargetMachine &TM, int index) {
    MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
int i = 0;
for (; MBBI != E; ++MBBI) {
    if (i == index) {
        MachineInstr *MI = MBBI;
        *errLog << " " << MI << ": ";
        MI->print(*errLog, &TM);
    }
    ++i;
}
if (i >= index) {
    // *errLog << "\n";
    errLog->flush();
}

unsigned getBaseReg(MachineInstr* MI) {
    // Pulled from FixInvalidRegPairOp()
    const MachineOperand &BaseOp = MI->getOperand(2);
    return BaseOp.getReg();
}

int getRegNum(unsigned RegEnum) {
    using namespace ARM;
    switch (RegEnum) {
    case R0: return 0;
    case R1: return 1;
    case R2: return 2;
    case R3: return 3;
case R4: return 4;
case R5: return 5;
case R6: return 6;
case R7: return 7;
case R8: return 8;
case R9: return 9;
case R10: return 10;
case R11: return 11;
case R12: return 12;
case SP: return 13;
case LR: return 14;
case PC: return 15;
default: return -1;
}
}

// Borrowing later definitions by LLVM
static bool isT2i32Load2(unsigned Opc) {
    return Opc == ARM::t2LDRi12 || Opc == ARM::t2LDRi8;
}
static bool isi32Load2(unsigned Opc) {
    return Opc == ARM::LDR || isT2i32Load2(Opc);
}

bool isLoad(MachineInstr* MI) {
    int Opcode = MI->getOpcode();
    return isi32Load2(Opcode) || Opcode == ARM::VLDRS || Opcode == ARM::VLDRD;
static bool isT2i32Store2(unsigned Opc) {
    return Opc == ARM::t2STRi12 || Opc == ARM::t2STRi8;
}

static bool isi32Store2(unsigned Opc) {
    return Opc == ARM::STR || isT2i32Store2(Opc);
}

bool isStore(MachineInstr* MI) {
    int Opcode = MI->getOpcode();
    return isi32Store2(Opcode) || Opcode == ARM::VSTRS || Opcode == ARM::VSTRD;
}

// Defined later by LLVM
// Only detects memory operations that this optimization CAN ACT UPON
static bool isMemoryOp(const MachineInstr *MI);

int MagnetPass::getAllMemOpsIndexOf(MachineInstr* MI) {
    for (int i = 0; i < (int)AllMemOps.size(); ++i) {
        if (AllMemOps[i].OpLocation == MI)
            return i;
    }
    return -1;
}
vector<int> MagnetPass::getRegsModified(MachineInstr* MI) {
    vector<int> regs;
    const TargetInstrDesc &TID = MI->getDesc();
    // If this is a branch, ALL registers are "modified" (LowerBounds and UpperBounds must end)
    if (TID.isBranch() || TID.isTerminator()) {
        for (unsigned i = 0; i < 16; ++i)
            regs.push_back(i);
    } else {
        for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
            MachineOperand &MO = MI->getOperand(i);
            if (MO.isReg() && MO.isDef()) {
                // Hack because getRegisterNumbering crashes
                const int rnum = getRegNum(MO.getReg());
                // If register is one of the 16 addressable registers
                if (rnum >= 0) {
                    if (find(regs.begin(), regs.end(), rnum) == regs.end()) {
                        regs.push_back(rnum);
                    }
                }
            }
        }
        return regs;
    }
}

vector<int> MagnetPass::getRegsUsed(MachineInstr* MI) {
    vector<int> regs;

const TargetInstrDesc &TID = MI->getDesc();

// If this is a branch, ALL registers are "used" (LowerBounds and UpperBounds must end)
if (TID.isBranch() || TID.isTerminator()) {
    for (unsigned i = 0; i < 16; ++i)
        regs.push_back(i);
} else {
    for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
        MachineOperand &MO = MI->getOperand(i);
        if (MO.isReg()) {
            // Hack because getRegisterNumbering crashes
            const int rnum = getRegNum(MO.getReg());
            // If register is one of the 16 addressable registers
            if (rnum >= 0) {
                if (find(regs.begin(), regs.end(), rnum) == regs.end()) {
                    regs.push_back(rnum);
                }
            }
        }
    }
    return regs;
}

void MagnetPass::endLowerBoundUsingRegsUsed(MachineInstr* MI) {
    vector<int> regs = getRegsUsed(MI);
    for (vector<int>::iterator it = regs.begin(); it != regs.end(); ++it) {
        // Code continues here...
    }
}
if (errLog) {
    *errLog << " Register used: " << *it;
    *errLog << ", with " << RegDeps[*it].use_dep.size() << " dependencies\n";
    errLog->flush();
}
*/

// Set all NULL LowerBoundes with this reg to point to this instr
vector<int>::iterator it2 = RegDeps[*it].use_dep.begin(),
    E = RegDeps[*it].use_dep.end();
for (; it2 != E; ++it2) {
    if (AllMemOps[*it2].LowerBound == NULL) {
        AllMemOps[*it2].LowerBound = MI;
        /*
         * if (errLog) {
         *     *errLog << " Setting a LowerBound\n";
         *     errLog->flush();
         * }
         */
    }
}
for (vector<int>::iterator reg_modified = regs.begin();
    reg_modified != regs.end(); ++reg_modified) {
    const int rnum = *reg_modified;
    /*
    if (errLog) {
        *errLog << " Register defined: " << rnum;
        *errLog << ", with " << RegDeps[rnum].mod_dep.size() << " dependencies\n";
        errLog->flush();
    }
    */
    vector<int>::iterator dependent_reg = RegDeps[rnum].mod_dep.
        begin(),
        E = RegDeps[rnum].mod_dep.end();
    for (; dependent_reg != E; ++dependent_reg) {
        if (AllMemOps[*dependent_reg].LowerBound == NULL) {
            AllMemOps[*dependent_reg].LowerBound = MI;
            /*
            if (errLog) {
                *errLog << " Setting " << AllMemOps[*dependent_reg
                    ].OpLocation << " LowerBound to " << MI << "\n";
                errLog->flush();
            }
            */
        }
    }
}
void MagnetPass::endUpperBoundUsingRegsUsed(MachineInstr* MI) {
  vector<int> regs = getRegsUsed(MI);
  for (vector<int>::iterator reg_used = regs.begin(); reg_used !=
       regs.end(); ++reg_used) {
    const int rnum = *reg_used;
    /*
     if (errLog) {
       *errLog << " Register used: " << rnum;
       *errLog << ", with " << RegDeps[rnum].use_dep.size() << " dependencies\n";
       errLog->flush();
     }
     */
    // Set all NULL LowerBoundes with this reg to point to this instr
    vector<int>::iterator dependent_reg = RegDeps[rnum].use_dep.
      begin(),
    E = RegDeps[rnum].use_dep.end();
    for (; dependent_reg != E; ++dependent_reg) {
      if (AllMemOps[*dependent_reg].UpperBound == NULL) {
        AllMemOps[*dependent_reg].UpperBound = MI;
        /*
         if (errLog) {
           *errLog << " Setting a UpperBound\n";
           errLog->flush();
         }
         */
      }
    }
  }
}
void MagnetPass::endUpperBoundUsingRegsModified(MachineInstr* MI) {
    vector<int> regs = getRegsModified(MI);
    for (vector<int>::iterator it = regs.begin(); it != regs.end(); ++it) {
        const int rnum = *it;
        /*
         * if (errLog) {
         *     *errLog << " Register defined: " << rnum;
         *     *errLog << ", with " << RegDeps[rnum].mod_dep.size() << " dependencies\n";
         *     errLog->flush();
         * }
         */
        vector<int>::iterator it2 = RegDeps[rnum].mod_dep.begin(),
            E = RegDeps[rnum].mod_dep.end();
        for (; it2 != E; ++it2) {
            if (AllMemOps[*it2].UpperBound == NULL) {
                AllMemOps[*it2].UpperBound = MI;
                /*
                 * if (errLog) {
                 *     *errLog << " Setting a UpperBound\n";
                 *     errLog->flush();
                 * }
                 */
            }
        }
    }
}
void MagnetPass::endLowerBoundUsingMem(MachineInstr* MI) {
    // Set all NULL LowerBounds with this reg to point to this instr
    vector<int>::iterator it2 = MemDeps.begin(), E = MemDeps.end();
    if (isLoad(MI)) {
        for (; it2 != E; ++it2) {
            if (AllMemOps[*it2].LowerBound == NULL && isStore(AllMemOps[*it2].OpLocation)) {
                AllMemOps[*it2].LowerBound = MI;
                if (errLog) {
                    *errLog << " Setting a LowerBound using MEM at " <<
                        AllMemOps[*it2].OpLocation << " to " << MI << "\n";
                    errLog->flush();
                }
            }
        }
    } else if (isStore(MI)) {
        for (; it2 != E; ++it2) {
            // Conservative approach to the aliasing problem means
            // we cannot move stores beyond stores, lest two
            // base+offsets point to the same memory location
            if (AllMemOps[*it2].LowerBound == NULL && (isLoad(AllMemOps[*it2].OpLocation) ||
                isStore(AllMemOps[*it2].OpLocation))) {
                AllMemOps[*it2].LowerBound = MI;
                if (errLog) {
                
            }
        }
    

}
*errLog << " Setting a LowerBound using MEM at " <<
    AllMemOps[*it2].OpLocation << " to " << MI << "\n";
    errLog->flush();
}

void MagnetPass::endUpperBoundUsingMem(MachineInstr* MI) {
    // Set all NULL UpperBounds with this reg to point to this instr
    vector<int>::iterator it2 = MemDeps.begin(),
    E = MemDeps.end();
    if (isLoad(MI)) {
        for (; it2 != E; ++it2) {
            if (AllMemOps[*it2].UpperBound == NULL && isStore(AllMemOps
                [*it2].OpLocation)) {
                AllMemOps[*it2].UpperBound = MI;
                if (errLog) {
                    *errLog << " Setting a UpperBound using MEM at " <<
                        AllMemOps[*it2].OpLocation << " to " << MI << "\n";
                    errLog->flush();
                }
            }
        }
    } else if (isStore(MI)) {
        for (; it2 != E; ++it2) {
            // Conservative approach to the aliasing problem means
            // we cannot move stores beyond stores, lest two
if (AllMemOps[*it2].UpperBound == NULL && (isLoad(AllMemOps[*it2].OpLocation) || isStore(AllMemOps[*it2].OpLocation))) {
    AllMemOps[*it2].UpperBound = MI;
    if (errLog) {
        *errLog << " Setting a UpperBound using MEM at " << AllMemOps[*it2].OpLocation << " to " << MI << "\n";
        errLog->flush();
    }
}

void MagnetPass::printBB(MachineBasicBlock &MBB, const TargetMachine &TM, string s) {
    *errLog << s << "\n";
    MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
    for (; MBBI != E; ++MBBI) {
        MachineInstr *MI = MBBI;
        *errLog << " " << MI << ": ";
        MI->print(*errLog, &TM);
    }
    errLog->flush();
}

void MagnetPass::clearAllDeps() {
    for (unsigned i = 0; i < RegDeps.size(); ++i) {

RegDeps[i].use_dep.clear();
RegDeps[i].mod_dep.clear();
}
MemDeps.clear();
}

void MagnetPass::findLowerBounds(MachineBasicBlock &MBB, const 
TargetMachine &TM) {
if (errLog) {
    *errLog << "findLowerBounds: Starting Analysis\n";
    errLog->flush();
}
clearAllDeps();
// Iterate though basic block
MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
for (; MBBI != E; ++MBBI) {
    MachineInstr *MI = MBBI;

    if (errLog) {
        *errLog << " " << MI << ": ";
        MI->print(*errLog, &TM);
        errLog->flush();
    }
endLowerBoundUsingRegsModified(MI);
endLowerBoundUsingRegsUsed(MI);
endLowerBoundUsingMem(MI);

    // FIXME: isMemoryOp() apparently doesn’t catch STR r2, [r1,
if (isMemoryOp(MI)) {
    /*
    if (errLog) {
        for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i ) {
            MachineOperand &MO = MI->getOperand(i);
            *errLog << " " << MO << " " << MO.getType() << "\n";
            // *errLog << " " << MO.getReg() << " " << MO.getType()
            () << "\n";
        }
        errLog->flush();
    }
    */
    int uMOR = getAllMemOpsIndexOf(MI);

    // If this MemOp has already been removed from AllMemOps, skip
    if (uMOR < 0)
        continue;

    MemDeps.push_back(uMOR);

    vector<int> regsUsed = getRegsUsed(MI);
    for (vector<int>::iterator it = regsUsed.begin(); it !=
        regsUsed.end(); ++it) {
        RegDeps[*it].mod_dep.push_back(uMOR);
        /*
        if (errLog) {
            */
*errLog << " Adding mod_dep using reg " << *it << " to this MI (" << AllMemOps[RegDeps[*it].mod_dep.size() - 1].OpLocation << ")\n";
errLog->flush();
}

// Only loads create use dependencies
if (isLoad(MI)) {
    vector<int> regsModified = getRegsModified(MI);
    for (vector<int>::iterator it = regsModified.begin(); it != regsModified.end(); ++it)
        RegDeps[*it].use_dep.push_back(uMOR);
}

if (errLog) {
    *errLog << "findLowerBounds: Finished\n";
    *errLog << "findLowerBounds: Current AllMemOps (" << AllMemOps.size() << ")\n";
    for (unsigned i = 0; i < AllMemOps.size(); ++i) {
        MemOpRecord a = AllMemOps[i];
        *errLog << " " << a.OpLocation << " with UpperBound = " << a.UpperBound << " and LowerBound = " << a.LowerBound << "\n";
        errLog->flush();
    }
} /*
*errLog << "findLowerBounds: Current RegDeps:\n";

for (unsigned i = 0; i < numRegs; ++i) {
    if (RegDeps[i].use_dep.size() > 0) {
        *errLog << " use_dep[" << i << "]: ";
        for (vector<int>::iterator it = RegDeps[i].use_dep.begin();
            it != RegDeps[i].use_dep.end(); ++it) {
            *errLog << AllMemOps[*it].OpLocation << ", ";
        }
        *errLog << "\n";
        errLog->flush();
    }
    if (RegDeps[i].mod_dep.size() > 0) {
        *errLog << " mod_dep[" << i << "]: ";
        for (vector<int>::iterator it = RegDeps[i].mod_dep.begin();
            it != RegDeps[i].mod_dep.end(); ++it) {
            *errLog << AllMemOps[*it].OpLocation << ", ";
        }
        *errLog << "\n";
        errLog->flush();
    }
}

void MagnetPass::findUpperBounds(MachineBasicBlock &MBB, const
TargetMachine &TM) {
    if (errLog) {

*errLog << "findUpperBounds: Starting Analysis\n";
errLog->flush();
}
clearAllDeps();

// Iterate through basic block in reverse order
MachineBasicBlock::iterator MBBRI = MBB.end(), E = MBB.begin();
for (; MBBRI != E; MBBRI--) {
    MachineBasicBlock::iterator MBBRI2 = MBBRI;
    --MBBRI2;
    MachineInstr *MI = MBBRI2;
    /*
    if (errLog) {
      *errLog << " " << MI << ": ";
      MI->print(*errLog, &TM);
      errLog->flush();
    }
    */
    endUpperBoundUsingRegsModified(MI);
    endUpperBoundUsingRegsUsed(MI);
    endUpperBoundUsingMem(MI);

    if (isMemoryOp(MI)) {
      // Locate MemOpRecord corresponding to this instr
      int uMOR = getAllMemOpsIndexOf(MI);
      // If this MemOp has already been removed from AllMemOps, skip
if (uMOR < 0)
    continue;

MemDeps.push_back(uMOR);

vector<int> regsUsed = getRegsUsed(MI);
for (vector<int>::iterator it = regsUsed.begin(); it !=
    regsUsed.end(); ++it) {
    RegDeps[*it].mod_dep.push_back(uMOR);
}

// Only loads create use dependencies
if (isLoad(MI)) {
    vector<int> regsModified = getRegsModified(MI);
    for (vector<int>::iterator it = regsModified.begin(); it
        != regsModified.end(); ++it)
        RegDeps[*it].use_dep.push_back(uMOR);
}

if (errLog) {
    *errLog << "findUpperBounds: Finished\n"
    *errLog << "findUpperBounds: Current AllMemOps (" << AllMemOps
        .size() << ")\n"
    for (unsigned i = 0; i < AllMemOps.size(); ++i) {
        MemOpRecord a = AllMemOps[i];
        *errLog << " " << a.OpLocation << " with UpperBound = " <<
            a.UpperBound << " and LowerBound = " << a.LowerBound << 
        
    
    \n";
errLog->flush();
}
/*
*errLog << "findUpperBounds: Current RegDeps:\n";
for (unsigned i = 0; i < numRegs; ++i) {
  if (RegDeps[i].use_dep.size() > 0) {
    *errLog << " use_dep[" << i << "]: ";
    for (vector<int>::iterator it = RegDeps[i].use_dep.begin();
         it != RegDeps[i].use_dep.end(); ++it) {
      *errLog << *it << ", ";
    }
    *errLog << "\n";
    errLog->flush();
  }
  if (RegDeps[i].mod_dep.size() > 0) {
    *errLog << " mod_dep[" << i << "]: ";
    for (vector<int>::iterator it = RegDeps[i].mod_dep.begin();
         it != RegDeps[i].mod_dep.end(); ++it) {
      *errLog << *it << ", ";
    }
    *errLog << "\n";
    errLog->flush();
  }
}
*/
errLog->flush();
ClusterPoint MagnetPass::getBestRangeOverlap(MachineBasicBlock & MBB) {
  if (errLog) {
    *errLog << "getBestRangeOverlap: Starting Analysis (AllMemOps has " << AllMemOps.size() << " entries)\n";
    errLog->flush();
  }

  // If any loads remain in AllMemOps, evaluate loads
  bool lookAtLoads = false;
  unsigned baseReg = 1337;
  for (vector<MemOpRecord>::iterator it = AllMemOps.begin(), e = AllMemOps.end(); it != e; ++it) {
    MachineInstr *MI = it->OpLocation;
    if (isLoad(MI)) {
      lookAtLoads = true;
      baseReg = getBaseReg(MI);
      break;
    }
  }
  // If we are evaluating loads, we already found the baseReg in the search loop
  if (!lookAtLoads)
    baseReg = getBaseReg(AllMemOps[0].OpLocation);

  if (errLog) {
    *errLog << " lookAtLoads set to " << lookAtLoads << " and baseReg set to " << baseReg << "\n";
  }
}
errLog->flush();
}

// Adds all initial loads or stores that can move to top of
// basic block
ClusterPoint currentCluster, bestCluster;
for (vector<MemOpRecord>::iterator it = AllMemOps.begin(), e =
    AllMemOps.end(); it != e; ++it) {
    MachineInstr *MI = it->OpLocation;
    // If lookAtLoads, we only match loads, else we only match
    // stores
    if ((lookAtLoads && isLoad(MI)) || (!lookAtLoads && !isLoad(MI
        ))) {
        // If the baseReg is what we want and UpperBound is NULL,
        add it to currentCluster
        if ((getBaseReg(MI) == baseReg) && (it->UpperBound == NULL))
            currentCluster.instructionsToGather.push_back(it->
                OpLocation);
    }
}
bestCluster = currentCluster;

// FIXME: Currently loads AND stores BOTH move as far DOWN as
// they can.
// Make stores move UP!
// Look through BB for concurrent live Ranges of Movement at
every BB instruction
for (MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end
MachineInstr *MI = MBBI;

currentCluster.insertAfter = MI;

// Check to see which AllMemOps with proper base reg have
// live Ranges of Movement at this instruction
for (vector<MemOpRecord>::iterator it = AllMemOps.begin(), e =
    AllMemOps.end(); it != e; ++it) {
    MachineInstr *AMOMI = it->OpLocation;

    // If lookAtLoads, we only match loads, else we only match
    // stores
    if (((lookAtLoads && isLoad(AMOMI)) || (!lookAtLoads && !
        isLoad(AMOMI))) {
        // If the baseReg is what we want
        if (getBaseReg(AMOMI) == baseReg) {
            // Add to set of live Ranges of Movement
            if (it->UpperBound == MI)
                currentCluster.instructionsToGather.push_back(it->
                    OpLocation);

            // Remove from set of live Ranges of Movement
            if (it->LowerBound == MI) {
                // FIXME: Is this broken like the contiguous check was
                // broken?
                vector<MachineInstr>::iterator temp = remove(
                    currentCluster.instructionsToGather.begin(),
                    currentCluster.instructionsToGather.end(), MI);
                currentCluster.instructionsToGather.erase(
                    currentCluster.instructionsToGather.begin(), temp)
// Keep track of where the largest cluster of live Ranges of Movement are
if (currentCluster.instructionsToGather.size() >= bestCluster.instructionsToGather.size())
    bestCluster = currentCluster;

if (errLog) {
    *errLog << " bestCluster created with " << bestCluster.instructionsToGather.size() << " entries\n";
    errLog->flush();
}

// Remove from AllMemOps all bestCluster.instructionsToGather, since we're going to move them
for (vector<MachineInstr*>::iterator it = bestCluster.instructionsToGather.begin(), e = bestCluster.instructionsToGather.end(); it != e; ++it) {
    for (vector<MemOpRecord>::iterator toRemove = AllMemOps.begin(); toRemove != AllMemOps.end(); ++toRemove) {
        if (toRemove->OpLocation == *it) {
            AllMemOps.erase(toRemove);
        }
    }
}
break;
}
}
}

// If bestCluster.insertAfter is in bestCluster.
instructionsToGather, remove it since it’s already been "moved"
for (vector<MachineInstr*>::iterator it = bestCluster.
    instructionsToGather.begin(), e = bestCluster.
    instructionsToGather.end(); it != e; ++it) {
    if (*it == bestCluster.insertAfter) {
        bestCluster.instructionsToGather.erase(it);
        break;
    }
}

if (errLog) {
    *errLog << "getBestRangeOverlap: Finished Analysis (AllMemOps
    has " << AllMemOps.size() << " entries)
    errLog->flush();
}
return bestCluster;

bool MagnetPass::allMIsContiguous(MachineBasicBlock &MBB, const
    TargetMachine &TM, ClusterPoint bestCluster) {
if (errLog) {
  *errLog << "allMIsContiguous: bestCluster.instructionsToGather
  Before Any Operations\n";
  for (vector<MachineInstr*>::iterator it = bestCluster.
    instructionsToGather.begin(), e = bestCluster.
    instructionsToGather.end();
    it != e; ++it) {
    *errLog << " " << *it << ": ";
    (*it)->print(*errLog, &TM);
  }
  *errLog << "allMIsContiguous: bestCluster.insertAfter Before
  Any Operations\n";
  *errLog << " " << bestCluster.insertAfter << ": ";
  bestCluster.insertAfter->print(*errLog, &TM);
  *errLog << "allMIsContiguous: Starting contiguous check\n";
  errLog->flush();
}

bestCluster.instructionsToGather.push_back(bestCluster.
  insertAfter);

// Check for contiguousness of all instructionsToGather
bool started = false;
for (MachineBasicBlock::iterator MBBI = MBB.begin()
  ; MBBI != MBB.end(); ++MBBI) {
  MachineInstr *MI = MBBI;
  if (bestCluster.instructionsToGather.end() !=
    find(bestCluster.
    instructionsToGather.begin(), bestCluster.
    instructionsToGather.end(), MI)) {

if (!started)
    started = true;
if (errLog) {
    *errLog << " MI found, removing " << MI << " - there were " << bestCluster.instructionsToGather.size();
    errLog->flush();
}
vector<MachineInstr*>::iterator temp = remove(bestCluster.
    instructionsToGather.begin(), bestCluster.
    instructionsToGather.end(), MI);
bestCluster.instructionsToGather.erase(temp, bestCluster.
    instructionsToGather.end());
if (errLog) {
    *errLog << " and are now " << bestCluster.
        instructionsToGather.size() << " instructionsToGather" 
        << "\n";
    errLog->flush();
}
else {
    if (started) {
        if (errLog) {
            *errLog << " MI not found after started, " << 
                bestCluster.instructionsToGather.size() << " instructionsToGather are left\n";
            errLog->flush();
        }
        if (bestCluster.instructionsToGather.size() == 0)
            return true;
    } else

return false;
}
if (errLog) {
  *errLog << " MI not found, " << bestCluster.
instructionsToGather.size() << " instructionsToGather are left\n";
  errLog->flush();
}
if (started)
  return true;
return false;
}

// Clustering is performed by removing all affected memory ops from the BB, removing all instructions after the
// insertion point, appending the memory ops after the insertion point, and finally appending the trailing
// instructions at the end.
void MagnetPass::gatherAtBestRangeOverlap(MachineBasicBlock &MBB,
const TargetMachine &TM, ClusterPoint bestCluster) {
if (errLog) {
  *errLog << "gatherAtBestRangeOverlap: bestCluster.
instructionsToGather Before Any Operations\n";
  for (vector<MachineInstr>::iterator it = bestCluster.
instructionsToGather.begin(), e = bestCluster.
instructionsToGather.end();
  instructionsToGather.begin()) e = bestCluster.
instructionsToGather.end();}
    it != e; ++it) {
        *errLog << " " << *it << ": ";
        (*it)->print(*errLog, &TM);
    }

    printBB(MBB, TM, "gatherAtBestRangeOverlap: MBB Before Any 
Operations");
    *errLog << "gatherAtBestRangeOverlap: Beginning <kill> 
analysis for " << bestCluster.instructionsToGather.size()
<< " bestCluster MIs\n";
    errLog->flush();
}

    // Split instructionsToGather into MIs that must move up and MIs 
that must move down
    vector<MachineInstr*> moveDown, moveUp;
    bool passedInsertAfter = false;
    for (MachineBasicBlock::iterator MBBI = MBB.begin()
        ; MBBI != MBB.end(); ++MBBI) {
        MachineInstr *MI = MBBI;
        // If MI is in instructionsToGather, insert it into the proper 
vector
        if (bestCluster.instructionsToGather.end() != find(bestCluster
            .instructionsToGather.begin(), bestCluster.
            instructionsToGather.end(), MI)) {
            if (passedInsertAfter)
                moveUp.push_back(MI);
            else
                moveDown.push_back(MI);
        }
if (MI == bestCluster.insertAfter)
    passedInsertAfter = true;
}

// Memory ops moving below another instruction which has a
// register kill must SUBSUME that kill.
// Memory ops with a register kill that move above another
// instruction using that register must
// ABINDICATE that kill to the other instruction.
enum reg_stats { NONE, TOSUBSUME, SUBSUMED, TOABDICATE,
    ABDICATED }

// FIXME: Since numRegs isn’t const, use workaround for custom
// number of regs
reg_stats regStatus[16];
for (unsigned i = 0; i < 16; ++i)
    regStatus[i] = NONE;

// START ANALYZING moveDown MIs for kills they may need to
// SUBSUME
// Any register used by any MI in moveDown could be:
// 1. Possibly killed in an instruction it moves after (NONE
//    -> TOSUBSUME), in which case would need to SUBSUME
// 2. Killed in an instruction it moves after, in which case
//    it must SUBSUME (NONE/TOSUBSUME -> SUBSUME)

// Iterate through BB even if no MIs need to be moved down, to
// set removeStartingHere
MachineBasicBlock::iterator removeStartingHere;
for (MachineBasicBlock::iterator MBBI = MBB.begin() :
    ; MBBI != MBB.end(); ++MBBI) {
    MachineInstr *MI = MBBI;

    bool moveDownContainsMI = false;
    // If MI is in moveDown, prepare to SUBSUME later kills to MI's regs
    if (moveDown.end() != find(moveDown.begin(), moveDown.end(),
        MI))
        moveDownContainsMI = true;

    // Examine all regs used by MI
    for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
        MachineOperand &MO = MI->getOperand(i);
        if (MO.isReg()) {
            const int rnum = getRegNum(MO.getReg());
            // If reg is one of the 16 addressable regs
            if (rnum >= 0) {
                // Prepare to SUBSUME all reg kills this MI uses
                if (moveDownContainsMI) {
                    if (MO.isKill()) {
                        regStatus[rnum] = SUBSUMED;
                        MO.setIsKill(false);
                    } else {
                        if (regStatus[rnum] != SUBSUMED)
                            regStatus[rnum] = TOSUBSUME;
                    }
                }
            }
        }
    }

    // Update whether moveDown MIs must SUBSUME this reg kill
else {
    if (MO.isKill() && (regStatus[rnum] != NONE)) {
        regStatus[rnum] = SUBSUMED;
        MO.setIsKill(false);
    } else {
    }
}

// Once we hit insertAfter, set removeStartingHere and leave
// MBB loop
if (MI == bestCluster.insertAfter) {
    removeStartingHere = MBBI;
    ++removeStartingHere;
    break;
}
} // End of MBBI iteration

// Remove moveDown MIs from MBB
for (vector<MachineInstr>::iterator it = moveDown.begin(), e =
    moveDown.end();
    it != e; ++it)
    MBB.remove(*it);

// Update moveDown kill flags based on regStatus values
for (unsigned i = 0; i < 16; ++i) {
    if (regStatus[i] == SUBSUMED) {
        bool stopUpdating = false;
// Find LAST moveDown MI that uses i reg and add a kill flag
for (vector<MachineInstr*>::reverse_iterator rit = moveDown.rbegin(), e = moveDown.rend();
    rit != e; ++rit) {
    //MachineInstr *MI = rit;
    // Examine all regs used by MI
    for (unsigned i = 0, e = (*rit)->getNumOperands(); i != e; ++i) {
        MachineOperand &MO = (*rit)->getOperand(i);
        if (MO.isReg()) {
            // If reg is one of the 16 addressable regs and is i
            if ((int)i == getRegNum(MO.getReg())) {
                (*rit)->addRegisterKilled(MO.getReg(), TM.getRegisterInfo());
                stopUpdating = true;
            }
        }
    }
    // We only want to update last moveDown MI using i
    if (stopUpdating)
        break;
}

// DONE ANALYZING moveDown MIs, START ANALYZING moveUp MIs for
// register kills it may need to ABDICATE
// Any register killed by any MI in moveUp could move behind
// an instruction using that register,
// in which case it must ABDICATE the kill to that instruction

// Reset regStatus values
for (unsigned i = 0; i < 16; ++i)
  regStatus[i] = NONE;

// If no MIs need to be moved UP, don’t iterate through BB backwards
if (moveUp.size() > 0) {
  for (MachineBasicBlock::reverse_iterator rMBBI = MBB.rbegin();
    rMBBI != MBB.rend(); ++rMBBI) {
    //MachineInstr *MI = rMBBI;

    // Once we hit insertAfter, leave MBB loop
    if (bestCluster.insertAfter == &(*rMBBI))
      break;

    bool moveUpContainsMI = false;
    // If MI is in moveUp, prepare to ABDICATE later kills to MI's regs
    if (moveUp.end() != find(moveUp.begin(), moveUp.end(), &(*rMBBI)))
      moveUpContainsMI = true;

    // Examine all regs used by MI
    for (unsigned i = 0, e = (rMBBI)->getNumOperands(); i != e;
      ++i) {
      MachineOperand &MO = (rMBBI)->getOperand(i);
      if (MO.isReg()) {
const int rnum = getRegNum(MO.getReg());

// If reg is one of the 16 addressable regs
if (rnum >= 0) {
    // Prepare to ABDICATE all regs this MI uses
    if (moveUpContainsMI) {
        if (MO.isKill()) {
            if (regStatus[rnum] != ABDICATED)
                regStatus[rnum] = TOABDICATE;
        } else {
            if (regStatus[rnum] == TOABDICATE) {
                regStatus[rnum] = ABDICATED;
                (rMBBI)->addRegisterKilled(MO.getReg(), TM.
                                            getRegisterInfo());
            }
    }
}

// Update MI and mark regStatus ABDICATED if we are waiting to abdicate
} else {
    if (regStatus[rnum] == TOABDICATE) {
        regStatus[rnum] = ABDICATED;
        (rMBBI)->addRegisterKilled(MO.getReg(), TM.
                                    getRegisterInfo());
    }
    
}

} // End of MBB reverse iteration

// Remove moveUp MIs from MBB
for (vector<MachineInstr*>::iterator it = moveUp.begin(), e =
     moveUp.end();
    it != e; ++it)
    MBB.remove(*it);
// Remove moveUp kill flags based on regStatus values
for (unsigned i = 0; i < 16; ++i) {
  if (regStatus[i] == ABDICATED) {
    // Find all moveDown MI that uses i reg and remove kill flag
    for (vector<MachineInstr*>::iterator it = moveUp.begin(), e = moveUp.end();
        it != e; ++it) {
      MachineInstr *MI = *it;
      // Examine all regs used by MI
      for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i)
        { MachineOperand &MO = MI->getOperand(i);
          if (MO.isReg())
            // If i reg is ABDICATED and moveUp MI kills reg
            if ((int)i == getRegNum(MO.getReg())) {
              if (MO.isKill() )
                MO.setIsKill(false);
            }
        }
    }
  }
}

// DONE ANALYZING moveUp MIs
if (errLog) {
  *errLog << "gatherAtBestRangeOverlap: Finished <kill> analysis";
}
printBB(MBB, TM, "gatherAtBestRangeOverlap: MBB After MemOp Removal");

// Set temp to hold all MBB non-MemOps after insertAfter
vector<MachineInstr*> temp;
bool saveTrailing = false;
for (MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end()) {
  MachineInstr *MI = MBBI;
  if (saveTrailing)
    temp.push_back(MI);
  if (MI == bestCluster.insertAfter)
    saveTrailing = true;
}

// Remove non-MemOps after insertAfter from MBB
for (vector<MachineInstr*>::iterator it = temp.begin(); it != temp.end(); ++it)
  MBB.remove(*it);

// Append bestCluster.instructionsToGather MemOps after insertAfter
for (vector<MachineInstr*>::iterator it = bestCluster.instructionsToGather.begin(), e = bestCluster.instructionsToGather.end();
  it != e; ++it)
MBB.push_back(*it);

// Append all non-MemOps after insertAfter back onto MBB
for (vector<MachineInstr*>::iterator it = temp.begin(); it !=
    temp.end(); ++it)
    MBB.push_back(*it);

if (errLog)
    printBB(MBB, TM, "gatherAtBestRangeOverlap: MBB After
    Rearrangement");
}

void MagnetPass::initialize(MachineBasicBlock &MBB, const
    TargetMachine &TM) {
    if (errLog) {
        *errLog << "initialize: Building AllMemOps and RegDeps\n";
        errLog->flush();
    }

    // Setup the RegDeps
    // FIXME: Creates 112 RegisterDependencies instead of 16 - why?
    numRegs = TM.getRegisterInfo()->getNumRegs();
    numRegs = 16;
    for (unsigned i = 0; i < numRegs; ++i) {
        RegisterDependencies d;
        RegDeps.push_back(d);
    }

    // Setup AllMemOps
for (MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
    MBBI != E; ++MBBI) {
    MachineInstr *MI = MBBI;
    if (isMemoryOp(MI)) {
        MemOpRecord MOR;
        MOR.OpLocation = MI;
        AllMemOps.push_back(MOR);
    }
}

void MagnetPass::cleanUp() {
    if (errLog) {
        *errLog << "cleanUp: Clearing AllMemOps and RegDeps\n";
        errLog->flush();
    }
    AllMemOps.clear();
    clearAllDeps();
}

void MagnetPass::runOptimization(MachineBasicBlock &MBB, const TargetMachine &TM) {
    initialize(MBB, TM);

    while (AllMemOps.size() > 0) {
        // Nullify all LowerBound and UpperBound values in AllMemOps
        for (vector<MemOpRecord>::iterator it = AllMemOps.begin(), e = AllMemOps.end())
; it != e; ++it) {
    it->UpperBound = NULL;
    it->LowerBound = NULL;
}

// Regenerate the ranges for the MemOps still remaining in AllMemOps
findLowerBounds(MBB, TM);
findUpperBounds(MBB, TM);

ClusterPoint bestCluster = getBestRangeOverlap(MBB);
if (!allMIsContiguous(MBB, TM, bestCluster))
gatherAtBestRangeOverlap(MBB, TM, bestCluster);
}
cleanUp();
}

Listing C.2: Enabling Optimization

/// ARMAllocLoadStoreOpt - Post- register allocation pass the combine
/// load / store instructions to form ldm / stm instructions.
namespace {
    struct ARMLoadStoreOpt : public MachineFunctionPass {
        static char ID;
        MagnetPass MagPass;
        BBPrinter PrintBefore, PrintAfter;
        ARMLoadStoreOpt() : MachineFunctionPass(ID), MagPass(),
            PrintBefore(true), PrintAfter(false) {}
bool ARMLoadStoreOpt::runOnMachineFunction(MachineFunction &Fn) {
    const TargetMachine &TM = Fn.getTarget();
    AFI = Fn.getInfo<ARMFunctionInfo>();
    TII = TM.getInstrInfo();
    TRI = TM.getRegisterInfo();
    RS = new RegScavenger();
    isThumb2 = AFI->isThumb2Function();

    bool Modified = false;
    for (MachineFunction::iterator MFI = Fn.begin(), E = Fn.end();
        MFI != E; ++MFI) {
        MachineBasicBlock &MBB = *MFI;
        MagPass.runOptimization(MBB, TM);
        //PrintBefore.printBB(MBB, TM);
        Modified |= LoadStoreMultipleOpti(MBB, TM);
        Modified |= MergeReturnIntoLDM(MBB);
        //PrintAfter.printBB(MBB, TM);
    }
    delete RS;
    return Modified;
}
C.3 Python Prototype

During development of our algorithm, we found it useful to create a prototype of our optimization to experiment with using the Python programming language. We did this due to Python’s excellent expressiveness and flexibility, which enabled us to validate our ideas without needing full knowledge of how to develop LLVM optimizations in C++. We choose to include this prototype code in this Appendix not only for completeness, but also because by virtue of this same expressiveness it may be easier to understand the underlying algorithm here than in the equivalent C++ code used in the full implementation.

Listing C.3: prototype.py

```python
#!/usr/bin/python

import instr_ops

ops = []
ops.append(instr_ops.MachineInstr('LDR r2 r3'))
ops.append(instr_ops.MachineInstr('ADD r1 r1 r1'))
ops.append(instr_ops.MachineInstr('LDR r0 r3'))
ops.append(instr_ops.MachineInstr('ADD r0 r0 r1'))
ops.append(instr_ops.MachineInstr('ADD r1 r0 r2'))
ops.append(instr_ops.MachineInstr('STR r1 r3'))
ops.append(instr_ops.MachineInstr('ADD r2 r1 r3'))

print "Preparing for first pass"
instr_ops.debugInfo()
```

for o in ops:
    print "Analyzing op", o.text
    instr_ops.endRangeMaxUsingRegsModified(o.getRegsModified(), o)
    instr_ops.endRangeMaxUsingRegsUsed(o.getRegsUsed(), o)

    if o.isLoad():
        t = instr_ops.MemOpRecord(o)
        for r in o.getRegsUsed():
            print "Adding mod_dep", r
            instr_ops.RegDeps[r].mod_dep.append(t.handle)
        print "Adding use_dep", o.text.split()[1]
        instr_ops.RegDeps[o.text.split()[1]].use_dep.append(t.handle)
        instr_ops.AllMemOps.append(t)
        instr_ops.debugInfo()

    if o.isStore():
        t = instr_ops.MemOpRecord(o)
        for r in o.getRegsUsed():
            print "Adding mod_dep", r
            instr_ops.RegDeps[r].mod_dep.append(t.handle)
        instr_ops.RegDeps[o.text.split()[1]].use_dep.append(t.handle)
        instr_ops.AllMemOps.append(t)

    instr_ops.debugInfo()

print "Running check before second pass"

instr_ops.endRangeMaxUsingRegsModified(instr_ops.RegDeps.keys(), None)
instr_ops.endRangeMaxUsingRegsUsed(instr_ops.RegDeps.keys(), None)
instr_ops.debugInfo()

for o in reversed(ops):
    print "Analyzing op", o.text

    instr_ops.endRangeMinUsingRegsModified(o.getRegsModified(), o)
    instr_ops.endRangeMinUsingRegsUsed(o.getRegsUsed(), o)

    if o.isLoad():
        for a in instr_ops.AllMemOps:
            if a.opHandle == o.handle:
                t = a
                break
        for r in o.getRegsUsed():
            print "Adding mod_dep", r
            instr_ops.RegDeps[r].mod_dep.append(t.handle)
            print "Adding use_dep", o.text.split()[1]
            instr_ops.RegDeps[o.text.split()[1]].use_dep.append(t.handle)

    if o.isStore():
        for a in instr_ops.AllMemOps:
            if a.opHandle == o.handle:
                t = a
                break
        for r in o.getRegsUsed():
            print "Adding mod_dep", r
            instr_ops.RegDeps[r].mod_dep.append(t.handle)
print "Running final check"

instr_ops.endRangeMinUsingRegsModified(instr_ops.RegDeps.keys(), None)
instr_ops.endRangeMinUsingRegsUsed(instr_ops.RegDeps.keys(), None)
instr_ops.debugInfo()

# ==============================================================
# Dependency info done
# ==============================================================
class CanMoveInfo:
    """ Holds MachineInstr handle to insert before, and MachineInstr handles 
    that can be inserted there """
    def __init__(self):
        self.insertAfter = None
        self.canMoveMIs = []

def memOpRecToMachineInstr(h):
    for o in ops:
        if o.handle == h.opHandle:
            return o
    raise RuntimeError

def isL(memOpRec):
return memOpRecToMachineInstr(memOpRec).isLoad()

def splitBySameBase(main, verbose=False):
    if verbose: print "Starting splitBySameBase on", len(main), "memOpRecords"

    ret = []
    curBase = None
    mainCopy = []
    mainCopy.extend(main)
    for a in mainCopy:
        if curBase == None:
            curBase = memOpRecToMachineInstr(a).getBaseReg()
            if verbose: print " curBase now set to", curBase
        if curBase == memOpRecToMachineInstr(a).getBaseReg():
            if verbose: print " Appending"
            ret.append(a)
            main.remove(a)
        else:
            if verbose: print " Not appending"

    if verbose: print "End of splitBySameBase, returning", len(ret), "memory ops"
    return main, ret

def getMaxInfoUsingLoads(loadsWithSameBaseReg, verbose=False):
    if verbose: print "Starting getMaxInfoUsingLoads with", len(loadsWithSameBaseReg)
curInfo = CanMoveInfo()
maxInfo = CanMoveInfo()

# Initially, see if we have RegDeps pointing to top of BB
for l in loadsWithSameBaseReg:
    if l.rmin == None:
        curInfo.canMoveMIs.append(memOpRecToMachineInstr(l))

# Set maxInfo to be whatever curInfo is after checking BB top
maxInfo.canMoveMIs = []
maxInfo.canMoveMIs.extend(curInfo.canMoveMIs)
maxInfo.insertAfter = None

if verbose: print " curInfo size:", len(curInfo.canMoveMIs)

# Scan through ops, updating curInfo as we go
for o in ops:
    curInfo.insertAfter = o
    # Check loads for beginning and end dependency markers
    for l in loadsWithSameBaseReg:
        if l.rmin == o.handle:
            curInfo.canMoveMIs.append(memOpRecToMachineInstr(l))
        if l.rmax == o.handle:
            curInfo.canMoveMIs.remove(memOpRecToMachineInstr(l))
    # If we have a new max, or if we can move down, update maxInfo
    if len(curInfo.canMoveMIs) >= len(maxInfo.canMoveMIs):
        maxInfo.canMoveMIs = []
        maxInfo.canMoveMIs.extend(curInfo.canMoveMIs)
maxInfo.insertAfter = curInfo.insertAfter
if verbose: print " curInfo size:", len(curInfo.canMoveMIs)
if verbose: print "End of getMaxLoads with maxInfo having", len(
    maxInfo.canMoveMIs), "entries and points to", maxInfo.
    insertAfter.handle
return maxInfo

def moveOpsUsingMaxInfo(ops, loads, maxInfo, verbose=False):
    retOps = []
    opsCopy = []
    opsCopy.extend(ops)
    for o in opsCopy:
        dontAppend = False
        # FIXME should be l in maxInfo.canMoveMIs, not loads!
        for l in loads:
            if memOpRecToMachineInstr(l).handle == o.handle:
                dontAppend = True
                if verbose: print o.handle, "skipped"
        if not dontAppend:
            retOps.append(o)
            if verbose: print o.handle, "appended"
            if o.handle == maxInfo.insertAfter.handle:
                break
        # Inserts load MachineInstrs and removes them from loads list
        loadsCopy = []
        loadsCopy.extend(loads)
        for lCopy in loadsCopy:
            if memOpRecToMachineInstr(lCopy) in maxInfo.canMoveMIs:
                retOps.append(memOpRecToMachineInstr(lCopy))
loads.remove(l)
if verbose: print memOpRecToMachineInstr(l).handle, "inserted"

passed = False
for o in ops:
    if passed:
        retOps.append(o)
        if verbose: print o.handle, "concatonated"
    # FIXME might not work with other cases - should append after loads
    if o.handle == maxInfo.insertAfter.handle:
        passed = True
        if verbose:
            print "After moving some loads"
    for o in retOps:
        print " ", o.handle, o.text
    print
return retOps, loads

print "Before load reordering"
for o in ops:
    print o.handle, o.text
print
loads = [a for a in instr_ops.AllMemOps if isL(a)]
# Loop until we have no more loads
loads, loadsWithSameBaseReg = splitBySameBase(loads, verbose=False)
while len(loadsWithSameBaseReg) > 0:

# Loop until we have no more loads with base reg to move
while len(loadsWithSameBaseReg) > 0:
    maxInfo = getMaxInfoUsingLoads(loadsWithSameBaseReg, verbose=False)
    ops, loadsWithSameBaseReg = moveOpsUsingMaxInfo(ops, 
        loadsWithSameBaseReg, maxInfo, verbose=False)

loads, loadsWithSameBaseReg = splitBySameBase(loads, verbose=False)

print "After load reordering"
for o in ops:
    print o.handle, o.text
    print

Listing C.4: instr_ops.py

#!/usr/bin/python

import logging

NUMREGS = 4

class MachineInstr:
    """ Hold instr text, a handle, and identity methods """
nextHandle = 0

def __init__(self, instr):
self.text = instr
self.handle = MachineInstr.nextHandle
MachineInstr.nextHandle += 1
self.op = instr.split()[0]
self.regsUsed = list(set(instr.split()[1:]))
self.regsModified = []
if not self.isStore():
    self.regsModified.append(instr.split()[1])

def isLoad(self):
    return self.op == 'LDR'

def isStore(self):
    return self.op == 'STR'

def isMemOp(self):
    return self.isLoad() or self.isStore()

def getRegsUsed(self):
    return self.regsUsed

def getRegsModified(self):
    return self.regsModified

def getBaseReg(self):
    return self.regsUsed[-1]

class Dependencies:
""" Holds MemOpRecord handles for reg dependencies """

def __init__(self):
    self.use_dep = []
    self.mod_dep = []

class MemOpRecord:
    """ Holds an instr handle and a MemOpRecord handle, along with ranges """
    nextHandle = 0

def __init__(self, o):
    self.handle = MemOpRecord.nextHandle
    MemOpRecord.nextHandle += 1
    self.opHandle = o.handle
    self.rmax = None
    self.rmin = None

def logme(self):
    logging.debug("MemOpRecord %s with op %s using MAX %s and MIN %s",
                  self.handle, self.opHandle, self.rmax, self.rmin)

RegDeps = {}
for i in range(0, NUMREGS):
    RegDeps["r" + str(i)] = Dependencies()

AllMemOps = []
def debugInfo():
    for a in AllMemOps:
        a.logme()
    for d in RegDeps:
        logging.debug(d)
        logging.debug(" U: %s", RegDeps[d].use_dep)
        logging.debug(" M: %s", RegDeps[d].mod_dep)
        logging.debug("")

def removeRegDepsUsingMemOp(d):
    """ Move through all RegDeps and remove MemOpRecord with handle d """
    for d2 in RegDeps:
        RegDeps[d2].use_dep = [v for v in RegDeps[d2].use_dep if v != d]
        RegDeps[d2].mod_dep = [v for v in RegDeps[d2].mod_dep if v != d]

def setRMaxUsingMemOpTo(d, finalInstr):
    """ Set all MemOpRecords with handle d to have RangeMax finalInstr (handle) """
    for a in AllMemOps:
        if a.handle == d:
            if finalInstr != None:
                a.rmax = finalInstr.handle
        else:
def setRMinUsingMemOpTo(d, finalInstr):
    ""
    Set all MemOpRecords with handle d to have RangeMin
    finalInstr (handle) ""
    for a in AllMemOps:
        if a.handle == d:
            if finalInstr != None:
                a.rmin = finalInstr.handle
            else:
                a.rmin = None
        break

def endRangeMaxUsingRegsUsed(regs, finalInstr):
    ""
    Each used reg has deps checked and RangeMax set ""
    for r in regs:
        for d in RegDeps[r].use_dep:
            logging.debug("Detected violated use_dep %s", d)
            setRMaxUsingMemOpTo(d, finalInstr)
        removeRegDepsUsingMemOp(d)

def endRangeMaxUsingRegsModified(regs, finalInstr):
    ""
    Each modified reg has deps checked and RangeMax set ""
    for r in regs:
        for d in RegDeps[r].mod_dep:
logging.debug("Detected violated mod_dep %s", d)
setRMaxUsingMemOpTo(d, finalInstr)
removeRegDepsUsingMemOp(d)

def endRangeMinUsingRegsUsed(regs, finalInstr):
    
    """ Each used reg has deps checked and RangeMin set """
    for r in regs:
        for d in RegDeps[r].use_dep:
            logging.debug("Detected violated use_dep %s", d)
            setRMinUsingMemOpTo(d, finalInstr)
            removeRegDepsUsingMemOp(d)

def endRangeMinUsingRegsModified(regs, finalInstr):
    
    """ Each modified reg has deps checked and RangeMin set """
    for r in regs:
        for d in RegDeps[r].mod_dep:
            logging.debug("Detected violated mod_dep %s", d)
            setRMinUsingMemOpTo(d, finalInstr)
            removeRegDepsUsingMemOp(d)