Cover

DDCArv_Ch7.pdf

Summary

# Introduction to microarchitecture Microarchitecture defines the hardware implementation of a computer architecture, detailing the processor's functional components and control mechanisms. ## 1. Introduction to microarchitecture Microarchitecture describes the hardware implementation of a given architecture. It focuses on the internal organization and design of a processor, specifically its datapath and control units [3](#page=3). ### 1.1 Processor components A processor, as defined by its microarchitecture, consists of two primary parts: * **Datapath:** This comprises the functional blocks responsible for executing operations, such as arithmetic logic units (ALUs), registers, and memory access units [3](#page=3). * **Control:** This unit generates the necessary control signals to orchestrate the operations of the datapath, ensuring instructions are executed in the correct sequence and with the appropriate data flow [3](#page=3). ### 1.2 Implementation strategies A single architecture can be implemented using various microarchitectural strategies, each with different performance characteristics. The primary strategies include: * **Single-cycle processor:** In this approach, each instruction is designed to complete its execution within a single clock cycle. While conceptually simple, this often leads to longer clock cycle times as the cycle must be long enough to accommodate the slowest instruction [4](#page=4). * **Multicycle processor:** This strategy breaks down the execution of each instruction into a series of shorter, sequential steps or cycles. This allows for a shorter clock cycle time compared to a single-cycle processor, as each step is simpler and takes less time. However, instructions still execute sequentially, one after another [4](#page=4). * **Pipelined processor:** This is an advanced implementation that further breaks down instructions into a series of steps, similar to the multicycle approach. The key difference is that multiple instructions can be in different stages of execution simultaneously, overlapping their execution. This significantly increases instruction throughput, even though individual instruction latency might not decrease [4](#page=4). > **Tip:** Understanding these different implementation strategies is crucial for appreciating the trade-offs between performance, complexity, and hardware cost in processor design. --- # Single-cycle processor implementation This section details the design of a single-cycle RISC-V processor, explaining its datapath and control logic to execute various instruction types, and analyzes its performance limitations [9](#page=9). ### 2.1 Overview of the single-cycle processor The single-cycle processor aims to execute an instruction in a single clock cycle. This requires designing a datapath that can perform all necessary operations for any instruction within that cycle, and a control unit that generates the appropriate signals for each instruction. The datapath is composed of various hardware components such as the Program Counter (PC), Instruction Memory, Register File, ALU, and Data Memory [10](#page=10). ### 2.2 Example program execution To understand the datapath's operation, an example program with `lw`, `sw`, `or`, and `beq` instructions is used [11](#page=11). * **`lw x6, -4(x9)` (I-Type)**: This instruction loads a word from memory into register `x6`. The memory address is calculated by adding the immediate value `-4` to the content of register `x9` [11](#page=11) [12](#page=12). * **Step 1: Fetch instruction:** The instruction at address `0x1000` is fetched from the Instruction Memory [13](#page=13). * **Step 2: Read source operand (rs1):** The value from register `x9` (specified by `rs1` field) is read from the Register File [14](#page=14). * **Step 3: Extend the immediate:** The 12-bit immediate value from the instruction is sign-extended to 32 bits. For `lw`, the immediate is `0xFFC` [15](#page=15). * **Step 4: Compute the memory address:** The value of `rs1` is added to the sign-extended immediate using the ALU. The ALU operation is `add` (control signal `ALUControl = 000`). The result is `0x00002000` [16](#page=16). * **Step 5: Read data from memory:** The computed address `0x00002000` is used to read data from the Data Memory. The data read is `0x10` (assuming this is the content at that address). This data will be written back to the Register File [17](#page=17). * **Step 6: Determine the address of the next instruction:** The PC is incremented by 4 to point to the next instruction (`0x1004`) [18](#page=18). * **Write back to Register File:** The data read from memory (`ReadData`) is written to the destination register `rd` (`x6`). This requires the `RegWrite` control signal to be asserted [17](#page=17). * **`sw x6, 8(x9)` (S-Type)**: This instruction stores a word from register `x6` into memory. The memory address is calculated by adding the immediate value `8` to the content of register `x9` [11](#page=11) [20](#page=20). * The immediate value is formed by concatenating bits `[31:25]` and `[11:7]` of the instruction [21](#page=21). * The `MemWrite` control signal is asserted to enable writing to the Data Memory [20](#page=20). * The value from `rs2` (register `x6`) is used as the `WriteData` [20](#page=20). * **`or x4, x5, x6` (R-Type)**: This instruction performs a bitwise OR operation between the contents of registers `x5` (`rs1`) and `x6` (`rs2`), and stores the result in register `x4` (`rd`) [11](#page=11) [22](#page=22). * For R-type instructions, both source operands (`rs1` and `rs2`) are read from the Register File [22](#page=22). * The ALU performs the `or` operation (control signal `ALUControl = 011`) [22](#page=22) [29](#page=29). * The `ALUSrc` control signal is set to `0`, indicating that the second ALU operand comes from the Register File (`rs2`) [22](#page=22). * The `ResultSrc` control signal is set to `0` to select the `ALUResult` as the data to be written back to the `rd` register [22](#page=22). * **`beq x4, x4, L7` (B-Type)**: This instruction branches to the label `L7` if the values in registers `x4` and `x4` are equal [11](#page=11) [23](#page=23). * The ALU performs a subtraction between the two source operands (`rs1` and `rs2`). If the result is zero, the `Zero` flag is set [23](#page=23) [29](#page=29). * The target address is calculated by adding the PC (which is `PC+4`) to the sign-extended immediate value [23](#page=23). * The `PCSrc` control signal determines whether the PC is updated with `PC+4` or the calculated `PCTarget`. If the `Zero` flag is asserted and the `Branch` control signal is active (for `beq`), `PCSrc` is `1`, selecting `PCTarget` as the next PC value [23](#page=23). ### 2.3 Immediate value extension The single-cycle processor needs to handle different immediate formats for various instruction types (I, S, B, J). An `ImmSrc` control signal selects the appropriate immediate generation logic [21](#page=21) [24](#page=24). * **I-Type:** Immediate is bits `[31:20]`, sign-extended [21](#page=21). * **S-Type:** Immediate is formed by concatenating bits `[31:25]` and `[11:7]`, sign-extended [21](#page=21). * **B-Type:** Immediate is formed by specific bit selections (` `, ` `, `[30:25]`, `[11:8]`, `[19:12]`), sign-extended, and the least significant bit is `0` [24](#page=24) [31](#page=31) [7](#page=7). * **J-Type (for `jal`):** Immediate is formed by specific bit selections (`[19:12]`, ` `, `[30:21]`), sign-extended, and the least significant bit is `0` [20](#page=20) [40](#page=40). ### 2.4 Control Unit The control unit decodes the instruction's `op` (opcode) and other relevant fields (like `funct3`, `funct7`) to generate control signals for the datapath [27](#page=27). * **Main Decoder:** Decodes the `op` field to determine the instruction type and generate initial control signals like `RegWrite`, `ImmSrc`, `ALUSrc`, `MemWrite`, `ResultSrc`, `Branch`, and `ALUOp` [28](#page=28). * `lw` (op=3): `RegWrite=1`, `ImmSrc=00`, `ALUSrc=1`, `MemWrite=0`, `ResultSrc=1`, `Branch=0`, `ALUOp=00` [28](#page=28). * `sw` (op=35): `RegWrite=0`, `ImmSrc=01`, `ALUSrc=1`, `MemWrite=1`, `ResultSrc=X`, `Branch=0`, `ALUOp=00` [28](#page=28). * R-type (op=51): `RegWrite=1`, `ImmSrc=XX`, `ALUSrc=0`, `MemWrite=0`, `ResultSrc=0`, `Branch=0`, `ALUOp=10` [28](#page=28). * `beq` (op=99): `RegWrite=0`, `ImmSrc=10`, `ALUSrc=0`, `MemWrite=0`, `ResultSrc=X`, `Branch=1`, `ALUOp=01` [28](#page=28). * I-Type ALU instructions (e.g., `addi`, `andi`, `ori`, `slti`) have `op=19`. They are similar to R-type but use the immediate as the second source operand. `RegWrite=1`, `ImmSrc=00`, `ALUSrc=1`, `MemWrite=0`, `ResultSrc=0`, `Branch=0`, `ALUOp=10` [35](#page=35) [36](#page=36). * `jal` (op=111): `RegWrite=1`, `ImmSrc=11`, `ALUSrc=X`, `MemWrite=0`, `ResultSrc=10`, `Branch=0`, `ALUOp=XX`, `Jump=1` [41](#page=41). * **ALU Decoder:** Based on `ALUOp` (from the main decoder), `funct3`, and `funct7` (for R-type), this decoder generates the 3-bit `ALUControl` signal to select the specific ALU operation [31](#page=31) [32](#page=32). * `add`: `ALUControl = 000` [29](#page=29). * `subtract`: `ALUControl = 001` [29](#page=29). * `and`: `ALUControl = 010` [29](#page=29). * `or`: `ALUControl = 011` [29](#page=29). * `slt` (set less than): `ALUControl = 101` [29](#page=29). ### 2.5 Extending the processor The single-cycle processor can be extended to handle more instruction types [34](#page=34). * **I-Type ALU instructions (`addi`, `andi`, `ori`, `slti`):** These instructions are similar to R-type but use the immediate value as the second source operand for the ALU. This requires asserting `ALUSrc` and selecting the appropriate immediate using `ImmSrc` [35](#page=35) [36](#page=36) [37](#page=37). * **`jal` (Jump and Link):** This instruction is similar to `beq` in that it uses an immediate for calculating a target address and always updates the PC. However, the jump is always taken. It also stores the `PC+4` address in the destination register `rd`. This requires a new `ImmSrc` value and `ResultSrc` to select `PC+4` [11](#page=11) [38](#page=38) [39](#page=39) [41](#page=41) [42](#page=42). ### 2.6 Performance limitations The primary performance limitation of a single-cycle processor is that the clock cycle time (`Tc`) is determined by the longest delay path, known as the critical path, which is usually associated with the `lw` instruction [43](#page=43) [45](#page=45). * **Program Execution Time:** This is calculated as: $$ \text{Execution Time} = (\text{# instructions}) \times (\text{cycles/instruction}) \times (\text{seconds/cycle}) $$ $$ \text{Execution Time} = \# \text{instructions} \times \text{CPI} \times \text{TC} $$ where CPI is Cycles Per Instruction. In a single-cycle processor, CPI is always 1 [44](#page=44). * **Critical Path Calculation:** The critical path for a `lw` instruction includes: $$ T_{C\_\text{single}} = t_{\text{pcq\_PC}} + t_{\text{mem}} + \max[t_{\text{RFread}}, t_{\text{dec}} + t_{\text{ext}} + t_{\text{mux}}] + t_{\text{ALU}} + t_{\text{mem}} + t_{\text{mux}} + t_{\text{RFsetup}} $$ This can be simplified by considering the dominant delays: $$ T_{C\_\text{single}} = t_{\text{pcq\_PC}} + 2t_{\text{mem}}+ t_{\text{RFread}} + t_{\text{ALU}} + t_{\text{mux}} + t_{\text{RFsetup}} $$ * **Example Calculation:** Using typical delay values: * $t_{\text{pcq\_PC}} = 40 \text{ ps}$ * $t_{\text{mem}} = 200 \text{ ps}$ * $t_{\text{RFread}} = 100 \text{ ps}$ * $t_{\text{ALU}} = 120 \text{ ps}$ * $t_{\text{mux}} = 30 \text{ ps}$ * $t_{\text{RFsetup}} = 60 \text{ ps}$ $$ T_{C\_\text{single}} = (40 + 2 \times 200 + 100 + 120 + 30 + 60) \text{ ps} = 750 \text{ ps} $$ * **Implication:** If a program has 100 billion instructions, the total execution time would be: $$ \text{Execution Time} = (100 \times 10^9) \times \times (750 \times 10^{-12} \text{ s}) = 75 \text{ seconds} $$ [1](#page=1). This highlights that all instructions, regardless of their complexity (e.g., a simple R-type instruction that takes much less time), must wait for the slowest instruction (`lw`) to complete within a single clock cycle. This inefficiency motivates the development of multi-cycle or pipelined processors [48](#page=48). --- # Multicycle processor implementation Multicycle processor designs break down instruction execution into a series of smaller steps, allowing for hardware reuse and a faster clock cycle compared to single-cycle processors [50](#page=50). ### 3.1 Overview of multicycle processors The multicycle processor approach aims to improve performance by dividing instruction execution into multiple clock cycles, where each cycle performs a specific micro-operation. This contrasts with single-cycle processors, where each instruction completes in a single, albeit potentially long, clock cycle [50](#page=50). Key advantages of multicycle processors include: * **Higher clock speed:** The cycle time is determined by the longest stage within an instruction's execution, not the entire instruction's longest possible path [51](#page=51). * **Faster execution for simpler instructions:** Instructions that require fewer steps complete in less time [51](#page=51). * **Hardware reuse:** Expensive functional units, such as memory, ALU, and register file, can be shared across different clock cycles for different operations, leading to a potentially simpler and more cost-effective design [51](#page=51). However, there is a sequencing overhead that is paid for each instruction, regardless of its complexity. The design process for a multicycle processor follows similar steps to a single-cycle processor: first, design the datapath, and then design the control unit [51](#page=51). ### 3.2 Multicycle datapath components The multicycle processor can utilize a single, unified memory for both instructions and data, which is more realistic than separate memories used in some single-cycle designs. Key state elements include the Program Counter (PC), Instruction Register (IR), Register File, and a unified Instruction/Data Memory [52](#page=52). The execution of an instruction is broken down into stages, illustrated here with the `lw` (load word) instruction: * **Step 1: Fetch instruction** The PC provides the address to fetch the instruction from memory. The fetched instruction is loaded into the Instruction Register (IR). The PC is also updated to PC+4, preparing for the next instruction fetch [53](#page=53). * **Step 2: Read source operand(s) and extend immediate** For instructions like `lw`, the immediate value is extended. The required source register(s) (e.g., `Rs1`) are read from the Register File [54](#page=54). * **Step 3: Compute memory address** The memory address for `lw` is calculated by adding the base register value (`Rs1`) to the sign-extended immediate value. This calculation is performed by the ALU [55](#page=55). * **Step 4: Read data from memory** The calculated address is used to read data from the Instruction/Data Memory [56](#page=56). * **Step 5: Write data back to register file** The data read from memory is written back to the destination register (`Rd`) in the Register File [57](#page=57). * **Step 6: Increment PC** The PC is incremented to PC+4. This step is fundamental to all instruction types for fetching the next instruction sequentially [58](#page=58). #### 3.2.1 Datapath for other instructions The datapath is designed to accommodate various instruction types by controlling the flow of data and the operation of functional units: * **`sw` (store word):** This instruction requires calculating the memory address, reading the data from a register (`Rs2`), and then writing that data to memory at the computed address. It does not write back to the register file [60](#page=60). * **`beq` (branch if equal):** To handle branches, the processor needs to compute the branch target address (BTA) which is `PC + imm`. It also needs to compare two source registers (`Rs1` and `Rs2`) using the ALU. The `Zero` output of the ALU is crucial here; if it's high, the PC is updated to the BTA. The current PC value needs to be preserved as `OldPC` for calculating the target address if the branch is taken [61](#page=61). * **R-Type instructions:** These instructions involve reading two source registers, performing an ALU operation (e.g., addition, subtraction), and writing the result back to a destination register. The ALU operation is determined by `funct3` and `funct7` fields [80](#page=80) [81](#page=81). * **I-Type ALU instructions (e.g., `addi`):** These instructions read a source register, read an immediate value, perform an ALU operation, and write the result back to a destination register [90](#page=90). * **`jal` (jump and link):** This instruction calculates the target address (`PC + 4`) and writes the PC+4 value into the destination register (`rd`). The PC is then updated to the target address [92](#page=92) [93](#page=93). ### 3.3 Multicycle control unit The control unit is responsible for generating the control signals that orchestrate the datapath operations across the multiple clock cycles. It consists of an Instruction Decoder, an ALU Decoder, and a Main Finite State Machine (FSM) [64](#page=64). * **Instruction Decoder:** This component decodes the `op` field of the instruction to determine the instruction type and generate signals like `ImmSrc` to control immediate value extension [65](#page=65). * **ALU Decoder:** This unit, similar to the single-cycle design, decodes the `op` and `funct3`/`funct7` fields to generate control signals for the ALU (`ALUControl`) [64](#page=64). * **Main FSM:** This is the core of the control unit. It sequences through states to execute an instruction, generating signals such as `ALUSrcA`, `ALUSrcB`, `ALUOp`, `ResultSrc`, `RegWrite`, `MemWrite`, `IRWrite`, `PCWrite`, `AdrSrc`, and `Branch` [66](#page=66). #### 3.3.1 Finite State Machine (FSM) for instruction execution The Main FSM manages the states for executing different instruction types. Each state is associated with specific control signal values for a particular clock cycle. * **Fetch (S0):** Reads instruction from memory, enables `IRWrite`, and sets `AdrSrc` to 0 to output PC. It also calculates PC+4 using the ALU while the ALU is not otherwise used, and sets `PCUpdate` [67](#page=67) [74](#page=74) [75](#page=75). * **Decode (S1):** Reads source registers (`Rs1`, `Rs2`) from the register file. For branches and jumps, it also calculates the target address (`PC + imm`) using the ALU [68](#page=68) [84](#page=84) [85](#page=85). * **Memory Address (S2):** Computes the memory address for load and store operations by adding the base register value to the sign-extended immediate [69](#page=69). * **Memory Read (S3):** Reads data from memory using the computed address for `lw` instructions [70](#page=70) [71](#page=71). * **Memory Write Back (S4):** Writes the data read from memory back to the register file for `lw` instructions [72](#page=72) [73](#page=73). * **Memory Write (S5):** Writes data from `Rs2` to memory for `sw` instructions [77](#page=77) [78](#page=78). * **R-Type Execute (S6):** Performs the ALU operation specified by the instruction on source registers [79](#page=79) [80](#page=80). * **ALU Write Back (S7):** Writes the result of an R-type ALU operation back to the destination register [81](#page=81) [82](#page=82). * **I-Type ALU Execute (S8):** Performs the ALU operation for I-type instructions, involving a register and an immediate value [89](#page=89) [90](#page=90). * **JAL (S9):** Calculates `PC + 4`, writes this value to `rd`, and updates the PC to the jump target address [91](#page=91) [92](#page=92). * **BEQ (S10):** Compares `Rs1` and `Rs2`. If they are equal (ALU `Zero` output is 1), it updates the PC to the branch target address; otherwise, it proceeds to the next sequential instruction [86](#page=86) [87](#page=87). > **Tip:** The FSM simplifies control signal management by defaulting most signals to 0 or "don't care" if not explicitly listed for a state, reducing the complexity of the control logic [66](#page=66). ### 3.4 Performance considerations Multicycle processors exhibit variable execution times for different instructions because they take a varying number of cycles to complete [96](#page=96). * **Instruction cycle counts:** * `beq`: 3 cycles [96](#page=96). * R-type, `addi`, `sw`, `jal`: 4 cycles [96](#page=96). * `lw`: 5 cycles [96](#page=96). * **Cycles Per Instruction (CPI):** The overall CPI is a weighted average based on the frequency of each instruction type in a program. For example, using the SPECINT2000 benchmark [96](#page=96): Average CPI = (0.13 * 3) + ((0.52 + 0.10) * 4) + (0.25 * 5) = 4.12 cycles/instruction [96](#page=96). * **Critical Path:** The critical path in a multicycle processor determines the clock cycle time. It is typically bounded by the longest delay among the stages. A common formula for the multicycle clock cycle time ($T_{c\_multi}$) is: $$T_{c\_multi} = t_{pcq} + t_{dec} + 2 \times t_{mux} + \max(t_{ALU}, t_{mem}) + t_{setup}$$ where: * $t_{pcq}$: PC clock-to-Q delay [98](#page=98). * $t_{dec}$: Decoder delay (control unit) [98](#page=98). * $t_{mux}$: Multiplexer delay [98](#page=98). * $t_{ALU}$: ALU delay [98](#page=98). * $t_{mem}$: Memory read delay [98](#page=98). * $t_{setup}$: Register setup time [98](#page=98). A sample calculation based on provided delays yields $T_{c\_multi} = 375$ ps [99](#page=99). * **Performance Example:** For a program with 100 billion instructions, a CPI of 4.12, and a clock cycle time of 375 ps, the total execution time would be: Execution Time = (100 × $10^9$) × 4.12 × (375 × $10^{-12}$) seconds = 155 seconds. This can be slower than a single-cycle processor if the single-cycle processor's clock speed is significantly higher [100](#page=100). > **Tip:** While multicycle processors offer hardware savings and better clock speeds than a naive single-cycle design, their overall performance is dictated by the CPI and clock cycle time. Optimizing the critical path and reducing the number of cycles for frequent instructions are key to improving performance. --- # Pipelined processor implementation and hazards This section details the implementation of a pipelined processor and the various hazards that can arise, along with strategies for their resolution. ### 4.1 Pipelined processor implementation Pipelining is a technique used to improve the throughput of a processor by overlapping the execution of multiple instructions. Instead of executing one instruction completely before starting the next (as in a single-cycle processor), pipelining divides the instruction execution into several distinct stages. These stages are then executed concurrently for different instructions . #### 4.1.1 The 5-stage pipeline The common 5-stage pipeline architecture consists of the following stages: 1. **Fetch (IF):** Fetches the instruction from memory based on the Program Counter (PC). 2. **Decode (ID):** Decodes the instruction and reads the required operands from the register file. 3. **Execute (EX):** Performs the arithmetic or logical operation using the ALU. 4. **Memory (MEM):** Accesses data memory for loads or stores. 5. **Writeback (WB):** Writes the result back to the register file. Pipeline registers are introduced between these stages to hold the intermediate results and control signals for each instruction as it progresses through the pipeline. This allows different instructions to be in different stages simultaneously . #### 4.1.2 Single-cycle vs. Pipelined execution In a single-cycle processor, each instruction takes one clock cycle to complete, and the clock cycle must be long enough to accommodate the slowest instruction (the critical path). This leads to underutilization of hardware, as simpler instructions take the same amount of time as complex ones . Pipelining, on the other hand, breaks down instruction execution into smaller stages, each taking one clock cycle. The clock cycle time is determined by the longest stage (the pipelined critical path). This allows for a higher throughput, as ideally, one instruction completes every clock cycle, even though individual instruction latency might increase slightly due to pipeline register overhead . * **Single-Cycle Execution:** An instruction occupies all stages for the entire clock cycle. * **Pipelined Execution:** Different instructions occupy different stages in each clock cycle. For example, in cycle 2, the first instruction might be in the Decode stage, while the second instruction is in the Fetch stage . #### 4.1.3 Pipelined datapath The datapath of a pipelined processor includes pipeline registers between each stage to store stage-specific information. Control signals are also propagated through the pipeline, with each instruction "dropping off" its control signals when they are no longer needed. Signals are often appended with the first letter of the stage they belong to (e.g., `PCF` for PC in Fetch, `PCD` for PC in Decode) . The corrected pipelined datapath ensures that register file reads occur before the write to avoid conflicts. The register file is typically written on the falling edge of the clock . #### 4.1.4 Control signals in pipelined processors The control unit for a pipelined processor is similar to that of a single-cycle processor, but its control signals are passed through the pipeline registers. These signals, such as `RegWrite`, `MemWrite`, `ALUSrc`, `ALUControl`, and `ResultSrc`, travel with the instruction and are asserted in the appropriate stages . ### 4.2 Pipelined processor hazards Hazards are conditions that prevent the next instruction in the instruction stream from executing during its normal clock cycle. They can cause a pipeline to deviate from its ideal behavior of completing one instruction per cycle . #### 4.2.1 Data hazards Data hazards occur when an instruction depends on the result of a previous instruction that has not yet completed and written its result back to the register file. There are three main types : 1. **RAW (Read After Write):** An instruction tries to read a register before a previous instruction has written to it. This is the most common type. 2. **WAR (Write After Read):** An instruction tries to write to a register that a previous instruction has already read. This is less common in simple pipelines due to the sequential nature of WB. 3. **WAW (Write After Write):** Two instructions try to write to the same register. The pipeline must ensure the writes occur in the correct program order. **Example:** Consider an `add` instruction followed by a `sub` instruction. If the `sub` instruction needs the result of the `add` instruction, and the `add` instruction has not yet written its result to the register file by the time `sub` needs it, a data hazard occurs . #### 4.2.2 Handling data hazards Several techniques can be employed to handle data hazards: * **Compile-time techniques:** * **Inserting NOPs (No-Operation instructions):** The compiler inserts dummy instructions to create delays, allowing the required data to become available . * **Code reordering:** The compiler rearranges independent instructions to execute before the dependent instruction, filling the pipeline stalls . * **Run-time techniques:** * **Data Forwarding (or Bypassing):** This is a hardware-based solution where the result of an instruction is forwarded from the output of an earlier stage (e.g., Execute or Memory) directly to the input of a later stage (e.g., Execute) before it is written back to the register file. This significantly reduces stalls . * The hazard unit checks if the source registers (`Rs1E`, `Rs2E`) for the instruction in the Execute stage match the destination registers (`RdM`, `RdW`) of instructions in the Memory or Writeback stages . * If a match is found and the previous instruction is writing to the register (`RegWriteM` or `RegWriteW` is asserted), the data is forwarded . * **Forwarding logic:** * **Case 1 (Forward from Memory):** If `Rs1E` (or `Rs2E`) matches `RdM` and `RegWriteM` is asserted, forward from the Memory stage. `ForwardAE` (or `ForwardBE`) is set to `10` . * **Case 2 (Forward from Writeback):** If `Rs1E` (or `Rs2E`) matches `RdW` and `RegWriteW` is asserted, forward from the Writeback stage. `ForwardAE` (or `ForwardBE`) is set to `01` . * **Case 3 (No Forwarding):** Otherwise, read from the register file. `ForwardAE` (or `ForwardBE`) is set to `00` . * **Exclusion of zero register:** The forwarding logic typically includes a check `(Rs1E!= 0)` or `(Rs2E!= 0)` to prevent forwarding from or to register `x0` (zero register) . * **Stalling (or Pipeline Bubbles):** If forwarding is not possible (e.g., for a load instruction where the data is not available until the Memory stage and needed in the Execute stage of the immediately following instruction), the pipeline must stall . * **Load-Use Hazard:** A common scenario is when an instruction immediately follows a `lw` instruction and uses the loaded data. The data from memory is only available at the end of the Memory stage, but it's needed in the Execute stage of the next instruction. This requires a stall . * **Stalling Logic:** A stall is typically triggered if a source register in the Decode stage (`Rs1D` or `Rs2D`) matches the destination register in the Execute stage (`RdE`), and the instruction in the Execute stage is a `lw` (indicated by `ResultSrcE0` being `0` for MEM stage result) . * The logic `lwStall = ((Rs1D == RdE) OR (Rs2D == RdE)) AND ResultSrcE0` determines if a stall is necessary . * When a stall occurs, the Fetch and Decode stages are frozen (`StallF = StallD = lwStall`), and the instruction in the Execute stage is flushed or invalidated (`FlushE = lwStall`) . #### 4.2.3 Control hazards Control hazards arise from branches and jumps, where the sequence of instruction execution changes based on a condition. The processor fetches instructions sequentially, but a branch instruction might cause it to jump to a different address. The decision about whether to branch is often made late in the pipeline (e.g., in the Execute stage), meaning instructions speculatively fetched after the branch might need to be discarded . **Example:** A `beq` instruction whose condition is evaluated in the Execute stage. If the branch is taken, the subsequent instructions fetched into the pipeline must be flushed . * **Branch Misprediction Penalty:** The number of instructions that are flushed when a branch is taken is called the branch misprediction penalty. In a 5-stage pipeline, if the branch decision is made in the Execute stage, typically 2 instructions (Fetch and Decode stages) are flushed . #### 4.2.4 Handling control hazards Methods to mitigate control hazards include: * **Branch Prediction:** The processor speculatively predicts whether a branch will be taken or not taken. If the prediction is correct, no stall occurs. If incorrect, the pipeline must be flushed. Sophisticated branch predictors exist (e.g., static prediction, dynamic prediction). * **Delayed Branch:** The instruction immediately following the branch is executed regardless of the branch outcome. This instruction is placed in the "branch delay slot" and must be useful to the program . * **Flushing:** If a branch is taken (or mispredicted), the instructions that have been fetched but not yet retired must be discarded or "flushed" from the pipeline . * **Flushing Logic:** If a branch is taken in the Execute stage (`PCSrcE` is asserted), the instructions in the Fetch and Decode stages need to be flushed. This is achieved by clearing the Decode and Execute pipeline registers using `FlushD` and `FlushE` signals . * The flushing logic is: `FlushD = PCSrcE` and `FlushE = lwStall OR PCSrcE`. This means that if there's a branch taken (`PCSrcE`) or a load-use stall (`lwStall`) that requires flushing the execute stage, `FlushE` becomes active. `FlushD` is active if a branch is taken . > **Tip:** Effective hazard handling is crucial for achieving near-ideal performance from a pipelined processor. While forwarding reduces stalls significantly for data hazards, load-use hazards and control hazards often necessitate stalls or flushing. ### 4.3 Pipelined processor performance The performance of a pipelined processor is measured by its average CPI (Cycles Per Instruction) and execution time. #### 4.3.1 Performance Metrics * **Average CPI:** Ideally, a pipelined processor achieves a CPI of 1. However, hazards cause stalls, increasing the average CPI. * **Example Calculation:** For a SPECINT2000 benchmark with 25% loads, 10% stores, 13% branches, and 52% R-type instructions: * Assume 40% of loads cause a stall (CPI = 2 for loads, 1.4 average: `1*(0.6) + 2*(0.4) = 1.4`). * Assume 50% of branches mispredict (CPI = 3 for branches, 2 average: `1*(0.5) + 3*(0.5) = 2`). * Average CPI = `(0.25 load CPI) + (0.1 store CPI) + (0.13 branch CPI) + (0.52 R-type CPI)` * Average CPI = `(0.25 * 1.4) + (0.1 * 1) + (0.13 * 2) + (0.52 * 1) = 0.35 + 0.1 + 0.26 + 0.52 = 1.23` . * **Clock Cycle Time (`Tc`):** The clock cycle time of a pipelined processor is determined by the longest delay of any single stage, plus pipeline register overhead (clock-to-Q and setup time) . * $T_{c\_pipelined} = \max(\text{Stage Delays}) + t_{pcq} + t_{setup}$ . * The critical path is often determined by the Execute stage, which includes ALU operations and multiplexers . * Example calculation for the critical path in the Execute stage: $T_{c\_pipelined} = (t_{pcq} + 4t_{mux} + t_{ALU} + t_{AND-OR} + t_{setup}) = (40 + 4*30 + 120 + 20 + 50) \text{ ps} = 350 \text{ ps}$ . * **Execution Time:** * Execution Time = (# instructions) × CPI × $T_c$ . * For 100 billion instructions, an average CPI of 1.23, and a $T_c$ of 350 ps: * Execution Time = $(100 \times 10^9) \times (1.23) \times (350 \times 10^{-12} \text{ seconds}) = 43 \text{ seconds}$ . #### 4.3.2 Performance Comparison Pipelining significantly improves performance compared to single-cycle or multicycle processors due to increased throughput . * **Single-cycle:** Baseline, high latency per instruction. * **Multicycle:** Better than single-cycle but still less efficient than pipelined. * **Pipelined:** Offers substantial speedup by overlapping execution, despite increased complexity and potential stalls due to hazards . --- # Advanced microarchitecture techniques Advanced microarchitecture techniques focus on enhancing processor performance beyond basic pipelining by employing a variety of methods to execute instructions more efficiently and concurrently . ### 5.1 Deep pipelining Deep pipelining involves increasing the number of stages in a processor's pipeline, typically to 10-20 stages. The number of stages is limited by factors such as pipeline hazards, sequencing overhead, power consumption, and cost . ### 5.2 Micro-operations Complex instructions are decomposed into a series of simpler instructions known as micro-operations (micro-ops or µ-ops). At runtime, complex instructions are decoded into one or more micro-ops. This technique is heavily used in CISC (complex instruction set computer) architectures, such as x86. For example, a `lw` instruction with a post-increment operation can be broken down into a `lw` micro-op and an `addi` micro-op. Without micro-ops, this would necessitate a second write port on the register file . > **Tip:** Micro-operations allow complex instructions to be handled by simpler hardware units, improving design flexibility and potentially performance. ### 5.3 Branch prediction Branch prediction is a crucial technique to mitigate the performance penalty caused by branches in pipelined processors. It involves guessing whether a branch will be taken to avoid flushing the pipeline . * **Static branch prediction:** This method determines the branch direction based on whether the branch is forward or backward. Backward branches (often found in loops) are typically predicted as taken, while forward branches are predicted as not taken . * **Dynamic branch prediction:** This approach uses historical data of recent branch executions to make more accurate predictions. A branch target buffer (BTB) stores the destination and taken status of the last several hundred or thousand branches . * **1-bit branch predictor:** Remembers the outcome of the last execution of a branch and predicts the same outcome for the next execution. This predictor mispredicts the first and last branches of a loop . * **2-bit branch predictor:** Uses a state machine with four states (Strongly Taken, Weakly Taken, Weakly Not Taken, Strongly Not Taken) to track branch behavior. This improves prediction accuracy, only mispredicting the last branch of a loop in the provided example . > **Example:** In a loop that executes 10 times, a 1-bit predictor will mispredict the branch condition the first time it's encountered (predicting not taken when it should be taken) and the last time (predicting taken when it should be not taken). A 2-bit predictor reduces this to only one misprediction. ### 5.4 Superscalar and out-of-order processors These architectures aim to increase Instruction-Level Parallelism (ILP) by executing multiple instructions concurrently . #### 5.4.1 Superscalar processors Superscalar processors feature multiple copies of datapath units that can execute several instructions simultaneously. The primary challenge is managing dependencies between instructions, which can prevent multiple instructions from being issued in the same clock cycle . > **Tip:** The ideal Instruction Per Cycle (IPC) for a superscalar processor is equal to the number of instructions it can issue per cycle. Actual IPC is often lower due to dependencies. #### 5.4.2 Out-of-order (OOO) processors OOO processors overcome limitations of in-order execution by looking ahead at multiple instructions and issuing them as soon as their operands are available and functional units are free, even if it's out of program order. This is done while respecting data dependencies : * **RAW (Read After Write):** A later instruction reads a register that an earlier instruction writes to . * **WAR (Write After Read):** A later instruction writes to a register that an earlier instruction reads . * **WAW (Write After Write):** A later instruction writes to a register that an earlier instruction also writes to . OOO processors utilize a scoreboard to track instructions waiting to issue, available functional units, and detected dependencies. This allows them to achieve higher actual IPC than in-order superscalar processors when dependencies cause stalls . > **Example:** If instruction A writes to register `r1` and instruction B reads from `r1`, instruction B cannot execute until instruction A has completed its write. An OOO processor can schedule other independent instructions to execute while waiting for `r1` to be ready. #### 5.4.3 Register renaming Register renaming is a technique used in OOO processors to eliminate WAR and WAW hazards. It achieves this by using a larger set of physical registers than architectural registers. When an instruction needs to write to a register, it's assigned a new, free physical register. This effectively separates the read and write operations, preventing false dependencies . > **Tip:** Register renaming is critical for enabling effective out-of-order execution by resolving naming conflicts among instructions. ### 5.5 SIMD (Single Instruction Multiple Data) SIMD is an architecture where a single instruction operates on multiple pieces of data simultaneously. This is common in graphics processing and for accelerating short arithmetic operations, also known as packed arithmetic . > **Example:** A single SIMD instruction could add eight 8-bit integer elements from two separate data sets, performing eight additions concurrently. ### 5.6 Multithreading and Multiprocessors These techniques focus on parallelism at a higher level than instruction execution . #### 5.6.1 Multithreading Multithreading allows a processor to handle multiple threads of execution concurrently. This means that multiple copies of the architectural state exist. When one thread stalls (e.g., waiting for memory), another thread can immediately start executing, utilizing idle execution units and increasing overall throughput. Intel's implementation is known as "hyperthreading" . * **Process:** A program running on a computer . * **Thread:** A part of a program. A process can have multiple threads; for instance, a word processor might have threads for typing, spell checking, and printing . * **Context Switching:** In a single-core system, when one thread stalls, its architectural state is saved, and another thread's state is loaded for execution . > **Tip:** Multithreading improves processor utilization and throughput, especially in workloads with frequent I/O or memory access stalls, but it does not increase the ILP of a single thread. #### 5.6.2 Multiprocessors Multiprocessors involve multiple processors (cores) on a single chip that communicate with each other. Types include : * **Homogeneous:** Multiple cores sharing main memory . * **Heterogeneous:** Cores designed for different tasks, like a DSP and a CPU in a mobile device . * **Clusters:** Each core has its own memory system . --- ## Common mistakes to avoid - Review all topics thoroughly before exams - Pay attention to formulas and key definitions - Practice with examples provided in each section - Don't memorize without understanding the underlying concepts

Glossary

| Term | Definition | |------|------------| | Microarchitecture | The specific hardware implementation of a computer architecture, detailing how an architecture is realized in silicon. It includes the design of the processor's datapath and control logic. | | Architecture | The abstract model of a computer, defining the instruction set, registers, memory addressing, and I/O that a programmer interacts with. It does not specify the hardware implementation details. | | Datapath | The part of a processor that performs data processing operations, including functional units like ALUs, registers, and multiplexers, and the pathways between them. | | Control Unit | The component of a processor that generates control signals to direct the operation of the datapath and other processor components based on the decoded instructions. | | Single-cycle processor | A processor design where each instruction completes execution in a single clock cycle, with the cycle time determined by the longest instruction path. | | Multicycle processor | A processor design that breaks down instruction execution into multiple clock cycles, allowing for shorter clock periods and hardware reuse. | | Pipelined processor | A processor design that overlaps the execution of multiple instructions by dividing instruction processing into sequential stages, with each stage operating on a different instruction concurrently. | | Clock Period (Tc) | The duration of a single clock cycle, measured in seconds. It is the inverse of the clock frequency ($Tc = 1/f_{clock}$). | | CPI (Cycles Per Instruction) | A measure of processor performance, representing the average number of clock cycles required to execute one instruction. | | IPC (Instructions Per Cycle) | A measure of processor performance, representing the average number of instructions that can be executed in one clock cycle. It is the inverse of CPI ($IPC = 1/CPI$). | | RISC-V | An open-standard instruction set architecture (ISA) based on Reduced Instruction Set Computer principles, designed to be simple, modular, and extensible. | | Architectural State Elements | The set of registers, the program counter (PC), and memory that define the current state of a processor visible to the programmer. | | Register File (RF) | A small, fast memory unit within a processor that stores frequently used data values in general-purpose registers. | | Program Counter (PC) | A special register that holds the memory address of the next instruction to be fetched and executed. | | Instruction Memory | A memory unit that stores program instructions. In some designs, it is separate from data memory. | | Data Memory | A memory unit that stores data values used and produced by the program. | | Immediate | A constant value that is part of an instruction itself, used directly in operations without needing to be fetched from memory or the register file. | | ALU (Arithmetic Logic Unit) | The digital circuit within a processor that performs arithmetic and logical operations on operands. | | Control Signals | Signals generated by the control unit that dictate the operation of various components within the datapath, such as selecting operations, enabling writes, or routing data. | | Hazard | A situation in a pipelined processor where the next instruction cannot execute in the next clock cycle due to data or control dependencies on previous instructions. | | Data Hazard | A pipeline hazard that occurs when an instruction needs data that has not yet been produced by a preceding instruction that is still in the pipeline. | | Control Hazard | A pipeline hazard that occurs when the processor fetches instructions that are not supposed to be executed, typically due to a branch instruction whose outcome is not yet known. | | Forwarding (Bypassing) | A technique used in pipelined processors to resolve data hazards by sending the result of an instruction directly from its execution stage to a subsequent instruction that needs it, before it is written back to the register file. | | Stall (Pipeline Bubble) | A mechanism in a pipelined processor to temporarily halt the pipeline and insert a "no-operation" (NOP) or "bubble" to resolve hazards, allowing dependencies to be met. | | Flush | The process of discarding instructions that have been incorrectly fetched due to a mispredicted branch or a resolved hazard in a pipelined processor. | | Branch Prediction | A technique used in pipelined processors to guess the outcome of a conditional branch instruction before it is fully resolved, aiming to keep the pipeline full. | | Superscalar Processor | A processor that can execute more than one instruction per clock cycle by having multiple execution units and the ability to issue multiple instructions simultaneously. | | Out-of-Order (OOO) Processor | A processor that can execute instructions in an order different from their original program sequence, as long as data dependencies are respected, to improve instruction-level parallelism. | | Register Renaming | A technique used in out-of-order processors to eliminate false data dependencies (Write-After-Write and Write-After-Read) by assigning temporary physical registers to architectural registers. | | SIMD (Single Instruction Multiple Data) | A parallel processing architecture where a single instruction operates on multiple data elements simultaneously. | | Multithreading | A technique that allows a single processor core to execute multiple threads of a program concurrently by duplicating architectural state, improving throughput. | | Multiprocessor | A system with multiple processing units (cores) that can operate in parallel, either sharing memory or having their own dedicated memory systems. | | Micro-operation (µ-op) | A very basic operation that a processor can perform, often used to decompose complex instructions into simpler steps for execution, particularly in CISC architectures. | | Branch Target Buffer (BTB) | A cache used in branch predictors to store the target addresses of recently executed branches, speeding up branch resolution. | | SPECINT2000 | A benchmark suite used to measure the performance of integer computation in computer systems. |