CS 4290/6290, ECE 4100/6100 Project 2: Tomasulo Algorithm Pipelined Processor


Category: You will Instantly receive a download link for .zip solution file upon Payment


5/5 - (2 votes)

Project Description In this project, you will complete the following: • Checkpoint 1: Construct a simulator for an out-of-order superscalar processor that uses the Tomasulo algorithm and fetches F instructions per cycle. 2 • Checkpoint 2: Enhance the simulator to maintain consistent state in the presence of exceptions with two separate schemes: 1. ROB with bypass and 2. Checkpoint repair. • Checkpoint 3: Use the simulator to find the appropriate number of function units, fetch rate and result buses for each benchmark. 1. Checkpoint 1 1.1. Directory Description The procsim_cpp.tar.gz package contains: 1. Makefile: compiles your code 2. procsim_driver.cpp: contains the main() method to run the simulator. Do not edit this file. 3. procsim.hpp: Used for defining structures and method declarations. You may edit this as you please. 4. procsim.cpp: Where all your methods should go. You may edit this as you please. 5. Traces folder: contains the traces to pass to the simulator (described below). Note: procsim_c.tar.gz contains equivalent files. 1.2. Command-Line Parameters Your project should include a Makefile, which builds binary in your project’s root directory named procsim. The program should run from this root directory as: ./procsim –r R –f F –j J –k K –l L < trace_file The command line parameters are as follows: • R – Result buses • F – Fetch rate (instructions per cycle) • J – Number of k0 function units • K – Number of k1 function units • L – Number of k2 function units • trace_file – Path name to the trace file 1.3. Input Trace Format The input traces will be given in the form: 3

… Where

is the address of the instruction (in hex) is either “0”, “1” or “2” , and are integers in the range [0..127] Note: If any reg # is -1, then there is no register for that part of the instruction (e.g., a branch instruction has -1 for its ) For example: ab120024 0 1 2 3 ab120028 1 4 1 3 ab12002c 2 -1 4 7 Means: “operation type 0” R1, R2, R3 “operation type 1” R4, R1, R3 “operation type 2” -, R4, R7 no destination register! As mentioned in 1.7, instructions of type -1 are executed in the type 1 function units. 1.4. Pipeline Stages Assume the following pipeline structure: Stage Name Number of Cycles per Instruction 1 Instruction Fetch/Decode 1 2 Dispatch Variable, depends on resource conflicts 3 Scheduling Variable, depends on data dependencies 4 Execute 1 4 5 State update Variable, depends on data dependencies (see notes below) 1.5. The Dispatch Queue • The fetch/decode rate is F instructions per cycle. Each cycle, the Fetch/Decode unit can always supply F instructions to the dispatch queue. • The dispatch queue length is unlimited, so there is always room for instructions in the dispatch queue. • The dispatch queue is scanned from head to tail (in program order). • When an instruction is inserted into the scheduling queue, it is deleted from the dispatch queue. • Tags are generated sequentially for every instruction, beginning with 0. The first instruction in a trace will use a tag value of 0, the second will use 1, etc. The traces are sufficiently short so that you should not need to reuse tags. 1.6. The Scheduling Queue • The size of the scheduling queue is 2*(number of k0 function units + number of k1 function units + number of k2 function units) • If there are multiple independent instructions ready to fire during the same cycle in the scheduling queue, service them in tag order (i.e., lowest tag value to highest). (This will guarantee that your results can match ours.) • A fired instruction remains in the scheduling queue until it completes. • The fire/issue rate is only limited by the number of available FUs. 1.7. The Function Units Function Unit Type Number of Units Latency 0 parameter: k0 1 1 parameter: k1 1 2 parameter: k2 1 • The number of function units is a parameter of the simulation and should be adjustable along the range of 1 to 3 units each. • Instructions of type -1 are executed in the type 1 function units. 5 1.8. Result Buses • There are R result buses (also called common data buses, or CDBs). This means that up to R instructions that complete in the same cycle may update the schedule queues and register file in the same cycle. • If an instruction wants to retire but all result buses are used, it must wait an additional cycle. The function unit is not freed in this case. Hence a subsequent instruction might stall because the function unit is busy. The function unit is freed only when the result is put onto a result bus. • The result buses do not have registers or latches. This means that broadcasts do not persist between clock cycles. The order of result bus broadcasts is as follows: • Within any given cycle, earlier tags get priority over later tags. • Earlier cycles with pending broadcasts get priority over later cycles. For example: Pending instruction tag … 7 6 8 5 12 … Pending since cycle … 5 5 5 6 6 … Where R = 2 Action: • Instructions with tags 6 and 7 will be broadcasted in parallel on the result buses in cycle N. • Instructions with tags 5 and 8 will be broadcasted in parallel on the result buses in cycle N+1. • Instruction with tag 12 will be broadcasted in parallel on the result buses in cycle N+2. 1.9. The State Update Unit • Assume that instructions retire in the same order that they complete. Instruction retirement is unconstrained (imprecise interrupts are possible (for checkpoint 1)). 1.10. Clock Propagation Note that the actual hardware has the following structure: • Instruction Fetch/Decode PIPELINE REGISTER 6 • Dispatch PIPELINE REGISTER • Scheduling PIPELINE REGISTER • Execute PIPELINE REGISTER • State update Instruction movement only happens when the latches are clocked, which occurs at the rising edge of each clock cycle. You must simulate the same behavior of the pipeline latches, even if you do not model the actual latches. For example, if an instruction is ready to move from scheduling to execute, the motion only takes effect at the beginning of the next clock cycle. Note that the latching is not strict between the dispatch unit and the scheduling unit, since both units use the scheduling queues. For example, if the dispatch unit inserts an instruction into one of the scheduling queues during clock cycle J, that instruction must spend at least cycle J+1 in the scheduling unit. Each stage of the pipeline can be divided into 2 cycle halves. Assume the following ordering of cycle halves (you do not need to explicitly model this, but please make sure your simulator follows this ordering of events): Cycle Portion Action 1 Mark completed instructions as completed/ retired 2 Broadcast results on the result bus and mark instruction as completed 3 The register file is written via a result bus 4 Any independent instruction in the scheduling queue is marked to fire and marked as issued 5 Scheduling queues are updated via a result bus 6 The dispatch unit reserves slots in the scheduling queues 7 The dispatch unit reads the register file 7 8 The state update unit deletes completed retired instructions from the scheduling queue 9 Fetch instructions 10 Program counter updated Note: Not all events are dependent on each other, and thus it is possible to have a different order of events and still achieve correct output. However, following this order, you should be guaranteed correctness. 1.11. Output For each trace, the output contains 2 files: 1. An output file, which contains: a. The processor settings b. A record of when each instruction was in each stage c. The processor statistics Correctness of your output is required for validation. Your simulator should output results to the terminal (stdout) and it should match the validated output on TSquare. See 1.11.2 for details. 2. A log file, which contains the cycle-by-cycle behavior of the machine. This file is not required for validation. This is simply there to help debug your code. See 1.11.1 for details. 1.11.1. Log File Details The log files are meant to help you debug and in particular to verify your algorithm if you’re working out a few instructions on paper first. You do not need to match the log file with your simulator. Log File Stage Represents the Cycle in Which FETCHED Fetch stage pushes instruction to dispatch queue. 8 DISPATCHED Dispatch stage pushes instruction to scheduler. SCHEDULED Schedule stage pushes the instruction to FU. EXECUTED The result of instruction has been computed and can be put onto the CDB if it is free. STATE UPDATE The instruction is deleted from the scheduling queue. 1.11.2. Output File Details The way to map FETCH, DISP, SCHED, EXEC and STATE in the outputs to messages in the logs is: Output File Stage Represents the Cycle in Which FETCH The instruction is fetched and put into the dispatch queue (same as FETCHED). DISP (The fetch stage push an instruction into dispatch queue) + 1 (FETCHED + 1). You can also think of it as “The first cycle that the instruction starts out in the dispatch queue”. SCHED (Dispatch stage pushes instruction to scheduler) + 1 (DISPATCHED + 1). You can also think of it as “The first cycle that the instruction starts out in the schedule queue”. EXEC (Schedule stage pushes instruction to FU) + 1 (SCHEDULED + 1). You can also think of it as “The first cycle that the instruction starts out in a FU”. STATE Instruction is deleted from the scheduling queue (same as STATE UPDATE). 9 1.11.3. Statistics The simulator outputs the following statistics after completion of the experimental run: 1. Total number of instructions in the trace 2. Average dispatch queue size 3. Maximum dispatch queue size 4. Average number of instructions fired per cycle 5. Average number of instructions retired per cycle (IPC) 6. Total run time (cycles) Your simulator should update the p_stats object to capture the statistics above. 1.12. Validation Your simulator must match the validation output that we will place on T-Square. 1.13. Submission Instructions for Checkpoint 1 Submit the following files in a single compressed folder: 1. Makefile 2. procsim.cpp 3. procsim.hpp 4. procsim_driver.cpp • Name the folder p2_c1_.tar.gz, for example, p2_c1_prabbat3.tar.gz. • Please compress the folder using the tar.gz compression format only. • When run, your simulator must write the output to the terminal (stdout) with identical formatting to the output files on T-Square. • However, do not include the traces, logs, and outputs in your submission. We will run your code and generate the output ourselves. 1.14. Grading for Checkpoint 1 Checkpoint 1 is worth 50 points: • 0% You do not hand in anything by the due date or you hand in someone else’s simulator • +50% Your simulator matches the checkpoint 1 validation output exactly. 10 1.15. Hints • Start early: this is a difficult project. • Work out by hand what should occur in each unit at each cycle for an example set of instructions (e.g., the first 10 instructions in one of the traces). Do this before you begin writing your program! • The algorithm in the notes is not intended to be implemented in C/C++/Java. Each “step” in the algorithm represents activity that occurs in parallel with other steps (due to the pipelining of the machine). Therefore, you will run into difficulty if you translate the algorithm from the notes to C/C++/Java directly. • Keep a counter for each line in the trace file. Use this for the Tomasulo tags. • Make each pipeline stage into a function (method). • Your simulator should continue running until all instructions have been retired. You may assume all instructions have retired when the number of instructions read equals the number of instructions retired. • Add a “debug” mode to your simulator that will print out what occurs during each simulated execution cycle. The debug mode should match the style of our example verification output. • Check T-Square and Piazza frequently for updates and notes. There will be a lot of clarifications required to get everyone matching our output. 1.16. Changelog All changes are marked in red. 3/13/16 1:00 PM [Paul]: • procsim.c[pp]: o Comment for setup_proc()said r = ROB size. Should be r = # result buses. • Project description: o 1.11, 1.13: Output results to terminal (stdout), not to result.output. o 1.3: Clarified -1 type instructions. 3/15/16 7:00 PM [Paul]: • Project description: o 1.8: Added note about result buses. o 1.10: Clarified ordering. 3/18/16 1:30 PM [Paul]: • Project description: o 1.7: There are 1 to 3 function units of a given type, not 1 to 2. o 1.8: Updated result bus priority scheme.