Master of Science in Electrical Engineering (MSEE)


Electrical and Computer Engineering

Document Type



Semiconductor feature size continues to decrease permitting superscalar microprocessors to continue to increase the number of functional units available for execution. As the instruction issue width increases beyond the five instruction average basic block size of integer programs, more than one basic block must be issued per cycle to continue to increase instructions per cycle (IPC) performance. Researchers have created methods of fetching instructions beyond the first taken branch to overcome the bottleneck created by the limitations of conventional single branch predictors. We compare the performance of the multiple branch prediction (MBP) and trace cache (TC) fetch unit optimization methods. Multiple branch predictor fetch unit designs issue multiple basic blocks per cycle using a branch address cache and a multiple branch predictor. A trace cache uses the runtime instruction stream to create fixed length instruction traces that encapsulate multiple basic blocks. The comparison is performed by using a SPARC v8 timing based simulator. We simulate both advanced fetch methods and execute benchmarks from the SPEC CPU2000 suite. The results of the simulations are compared and a detailed analysis of both microarchitectures is performed. We find that both fetch unit designs provide a competitive IPC performance. As issue width is increased from an eight to sixteen way superscalar, the IPC performance improves implying that these fetch unit designs are able to take advantage of the wider issue widths. The MBP can use a smaller L1 instruction cache size than the TC and yet achieve a similar IPC performance. Pre-arranged instructions provided by the TC allow the pipeline stages to be shortened in comparison to the MBP. The shorter pipeline significantly improves the IPC performance. Prior trace cache research used two or more ports to the instruction cache to improve the chances of fetching a full basic block per cycle. This was at the expense of instruction cache line realignment complexity. Our results show good performance with a single instruction cache port. We study an approximately equal cost implementation for the MBP and TC. Of the six benchmarks studied, the TC outperforms the MBP over four of the benchmarks.



Document Availability at the Time of Submission

Release the entire work immediately for access worldwide.

Committee Chair

David Koppelman