Computer Engineering
Cover
1__ServerHardware-Interfaces.pdf
Summary
# Basis computerhardware en datacenterarchitectuur
Dit studieonderwerp duikt in de fundamentele bouwstenen van computers en de geavanceerde infrastructuur van datacenters.
### 1.1 Basiscomponenten van een computer
Een computer bestaat uit diverse essentiële componenten die samenwerken om functionaliteit te bieden. Deze omvatten de centrale verwerkingseenheid (CPU), Random Access Memory (RAM), opslagapparaten, het moederbord, de voedingseenheid (PSU) en koelsystemen, verbonden via interconnecties [4](#page=4).
#### 1.1.1 Centrale Verwerkingseenheid (CPU)
De CPU is een reeks elektrische circuits die fundamentele berekeningen uitvoert zoals laden, opslaan, vergelijken, optellen, aftrekken en verschuiven. De architectuur van een CPU wordt bepaald door de fabrikant (bijv. Intel, AMD, ARM) en het aantal bits dat tegelijkertijd wordt verwerkt (x32 of x64). De kloksnelheid, gemeten in Hertz (Hz), indiceert het aantal instructies dat per seconde kan worden uitgevoerd; bijvoorbeeld, 4 GHz staat voor 4 miljard instructies per seconde. CPU's kunnen worden geclassificeerd als CISC (Complex Instruction Set Computer), RISC (Reduced Instruction Set Computer) of EPIC (Explicitly Parallel Instruction Computing) [5](#page=5).
Een processor kan verder worden onderverdeeld in:
* **Processor socket:** Een behuizing die meerdere processors koppelt aan het moederbord [5](#page=5).
* **Processor core:** Een onafhankelijke verwerkingseenheid binnen een processor [5](#page=5).
* **Processor thread:** Een 'logische processor' die bronnen deelt met andere threads op dezelfde core [5](#page=5).
> **Tip:** Een systeem met 4 processor sockets, elk met 8 cores en 16 threads per core, heeft effectief 512 logische CPU's (LCPU's). Informatie over de CPU kan worden opgevraagd via `msinfo32` in Windows of `lscpu` in Linux [5](#page=5).
#### 1.1.2 RAM en Opslag
RAM (Random Access Memory) dient als snel, maar vluchtig geheugen dat dicht bij de CPU staat, waardoor het minder capaciteit heeft vanwege de hogere kosten. Opslag daarentegen is persistent, trager en biedt een zeer hoge capaciteit tegen lagere kosten. Cache is een beperkte opslagruimte binnen de CPU of op opslagapparaten voor verdere optimalisatie. Trends duiden op het verdwijnen van dit onderscheid, met toekomstige technologieën zoals kwantumcomputers die de opslagcapaciteit per bit aanzienlijk zullen vergroten [6](#page=6).
#### 1.1.3 Interconnecties en Communicatie
Voor servers zijn diverse interconnectietopologieën belangrijk [7](#page=7):
* **Bus topology (SCSI, PATA):** Meerdere eindpunten communiceren via een gemeenschappelijk medium [7](#page=7).
* **Point-to-Point:** Twee eindpunten communiceren via een privé medium, vaak richting randapparatuur (SAS, SATA) [7](#page=7).
* **Ring (FC - Fibre Channel):** Elk eindpunt maakt deel uit van een lus en stuurt pakketten die niet voor hem bestemd zijn door [7](#page=7).
* **Star/Tree/Mesh (iSCSI):** Meestal gerelateerd aan netwerkprotocollen [7](#page=7).
De snelheid van deze verbindingen wordt bepaald door de doorvoer (throughput) en latentie (latency) [7](#page=7).
#### 1.1.4 Verschillen tussen apparaten (PC, Workstation, Server)
Belangrijke factoren bij de keuze van apparatuur zijn het gebruiksdoel en de kosten [8](#page=8).
* **PC:** Ontworpen voor gebruiksgemak door niet-IT-professionals, draagbaarheid en robuustheid [8](#page=8).
* **Desktop:** Een PC die standaard interconnecties gebruikt voor diverse randapparatuur [8](#page=8).
* **Workstation:** Een PC die hoge CPU-kracht vereist [8](#page=8).
* **Servers:** Stellen hoge eisen aan betrouwbaarheid, beschikbaarheid en onderhoudbaarheid (RAS - Reliability-Availability-Serviceability), standaardisatie, remote management en diagnostische tools, en professionele uitbreidingsmogelijkheden [8](#page=8).
### 1.2 Datacenterarchitectuur
Een datacenter is een gespecialiseerde faciliteit voor het huisvesten en beheren van IT-apparatuur. Dit omvat niet alleen servers, maar ook netwerkapparatuur zoals switches en routers, patch panelen en KVM-switches, die vaak zijn gemonteerd in 19-inch racks [10](#page=10) [15](#page=15) [30](#page=30).
#### 1.2.1 Server Form Factors
Servers komen in diverse vormfactoren om te voldoen aan verschillende eisen qua dichtheid, opslag, koeling en voeding [31](#page=31).
* **1U Server:** Compacte servers, ideaal voor hoge dichtheid, maar met beperkte opslag en aandacht voor koeling en geluid. Ze gebruiken vaak horizontale slots en hebben doorgaans één voeding [31](#page=31).
* **2U, 3U, 4U Servers:** Grotere form factors die meer opslagruimte, meer uitbreidingsmogelijkheden en krachtigere koelingsoplossingen bieden [30](#page=30) [35](#page=35).
* **Blade Servers:** Deze uiterst compacte servers (bv. 14 blades in 6U) bieden hoge dichtheid en schaalbaarheid. Ze maken gebruik van gedeelde power modules, fans en switches in een chassis, wat resulteert in minder kabels en centraal beheer. Blade servers bieden flexibele installatie van netwerk- en opslagmodules en ingebouwde redundantie. Nadelen zijn de vaak beperkte verwerkingskracht per blade, vendor lock-in, de hoge kosten indien niet volledig benut, en intensieve koelingsbehoeften [37](#page=37) [39](#page=39).
#### 1.2.2 Opslag in Datacenters
Opslagarchitecturen variëren van Direct Attached Storage (DAS) tot Network Attached Storage (NAS) en Storage Area Networks (SAN) [47](#page=47).
* **Direct Attached Storage (DAS):** Eenvoudige opslag direct gekoppeld aan een server. Kan bestaan uit losse schijven of DAS-enclosures, inclusief JBODs (Just-a-Bunch-Of-Disks). Biedt hoge prestaties maar beperkte schaalbaarheid en back-upmogelijkheden. Daisy-chaining van DAS-enclosures is mogelijk [50](#page=50) [51](#page=51) [52](#page=52).
* **Network Attached Storage (NAS):** Bestandsgebaseerde opslag die via het netwerk toegankelijk is, gebruikmakend van protocollen zoals CIFS (Windows) en NFS (Unix). NAS-systemen zijn ideaal voor heterogene besturingssystemen, eenvoudig op te zetten en uit te breiden. Een nadeel is de schaalbaarheid van prestaties, die vaak beperkt wordt door de LAN-prestaties. Diensten als OneDrive, Dropbox en Google Drive vallen onder deze categorie van netwerkgebaseerde opslag [49](#page=49) [53](#page=53) [54](#page=54).
* **Storage Area Network (SAN):** Een speciaal netwerk dat blokgebaseerde opslag biedt, gekoppeld aan servers. SAN's bieden hogere redundantie en schalen zeer goed op het gebied van prestaties en capaciteit. Twee belangrijke SAN-technologieën zijn Fibre Channel (FC) en iSCSI (Internet Small Computer System Interface). iSCSI maakt gebruik van reguliere Ethernet-hardware, terwijl FC aangepaste hardware en software vereist [55](#page=55) [57](#page=57) [58](#page=58) [60](#page=60).
* **iSCSI Terminologie:**
* **iSCSI Target:** De iSCSI 'server' die Logical Unit Numbers (LUNs) aanbiedt [59](#page=59).
* **iSCSI Initiator:** De iSCSI 'client' die verbinding maakt met LUNs [59](#page=59).
Voordelen van SAN zijn gecentraliseerde back-up, disaster recovery en schaalbaarheid, maar ze zijn duurder en vereisen dat het bestandssysteem LUN-deling ondersteunt [60](#page=60).
#### 1.2.3 Disk Interfaces
Verschillende interfaces worden gebruikt om schijven aan te sluiten [61](#page=61):
* **Parallel ATA (PATA):** Vroeger bekend als IDE, gebruikt voor CD-spelers en harde schijven. Beperkt tot twee apparaten per connector en een maximale kabellengte van 45 cm [63](#page=63).
* **SCSI (Small Computer System Interface):** Oudere technologie die zowel interne als externe interfaces ondersteunde. Elk SCSI-controller kan tot 15 apparaten adresseren met een unieke SCSI ID. Wordt nog steeds gebruikt voor tapestreamers [64](#page=64) [65](#page=65).
* **Serial ATA (SATA):** Hedendaagse standaard voor desktops en laptops, met verschillende versies (SATA I, II, III) die snelheden bieden van 1.5 Gbps tot 6 Gbps. SATA is een point-to-point systeem, wat complexiteit vermindert. eSATAp combineert eSATA met USB voor externe connectiviteit [66](#page=66) [67](#page=67).
* **Serial Attached SCSI (SAS):** Implementeert het SCSI-protocol over SATA-interfaces, met voordelen zoals point-to-point communicatie, geen prestatieverlies en ondersteuning voor meer apparaten per controller. SAS is onder bepaalde omstandigheden compatibel met SATA. Diverse interne en externe SAS-connectoren bestaan, zoals SFF-8482, SFF-8087 en SFF-8088 [70](#page=70) [71](#page=71).
* **U.2 en M.2:** Moderne interfaces die PCI Express-protocollen kunnen gebruiken (NVMe) naast SATA, waardoor hogere prestaties mogelijk zijn [69](#page=69).
#### 1.2.4 Card Slot Interfaces
Uitbreidingskaarten worden via verschillende slot interfaces op het moederbord aangesloten [74](#page=74):
* **PCI (Peripheral Component Interconnect):** Een bussysteem waarbij uitbreidingskaarten via riser cards kunnen worden aangesloten. De PCI-bus ondersteunt snelheden tot 132 MBps bij 33 MHz (32-bit) of 264 MBps bij 66 MHz (32-bit) [75](#page=75).
* **PCI-X:** Een verbeterde versie van PCI, voornamelijk gebruikt in servers, met 64-bit adressering en hogere frequenties, tot 528 MBps (66 MHz, 64-bit) of zelfs 4264 MBps (533 MHz, 64-bit) [76](#page=76).
* **AGP (Accelerated Graphics Port):** Ontworpen om de beperkingen van PCI voor grafische kaarten te overwinnen, met snelheden tot 2133 MBps (AGP 8X). AGP werd voornamelijk gebruikt in desktops en workstations [77](#page=77).
* **PCI Express (PCI-E):** De moderne standaard die PCI, PCI-X en AGP vervangt. PCI-E is een point-to-point verbinding, wat zorgt voor meer doorvoer, lagere pin-aantallen en betere schaalbaarheid. De link kan variëren van x1 tot x32 lanes, met snelheden die per generatie toenemen (bijv. 1 GBps per lane voor PCIe 3.0, 2 GBps per lane voor PCIe 4.0). Het maximale stroomverbruik per PCIe-interface is 75 Watt [78](#page=78) [79](#page=79).
> **Tip:** Hoewel een slot mechanisch x16 kan zijn, betekent dit niet altijd dat het elektrisch ook 16 lanes ondersteunt. Riser cards worden vaak gebruikt in servers om meer plug-in kaarten aan te sluiten [80](#page=80).
---
# Datacenterbeheer en kosten
Dit onderdeel behandelt de financiële aspecten van datacenters, waaronder de totale eigendomskosten (TCO), rendementspercentage (ROI) en terugverdientijd, evenals de beheer-, onderhouds- en beveiligingsaspecten.
### 2.1 Kostenaspecten van datacenters
De kosten van datacenters zijn veelzijdig en omvatten zowel initiële investeringen als doorlopende operationele uitgaven.
#### 2.1.1 Total Cost of Ownership (TCO)
De TCO vertegenwoordigt de totale kosten die gepaard gaan met het bezitten van een IT-infrastructuur gedurende de gehele levenscyclus. Dit omvat een breed scala aan componenten:
* Hardware [13](#page=13).
* Software [13](#page=13).
* Datacenterkosten (bv. ruimte, stroom, koeling) [13](#page=13).
* Managementkosten [13](#page=13).
De kosten voor energie zijn een significant en stijgend onderdeel van de datacenteruitgaven. De energie per server blijft toenemen. Momenteel wordt voor elke dollar die aan hardware wordt uitgegeven, vijftig cent aan energie besteed. Deze energiekosten zullen naar verwachting in de komende vier jaar met 54% stijgen [11](#page=11).
> **Tip:** Het goed begrijpen van de TCO is cruciaal voor een accurate budgettering en strategische besluitvorming met betrekking tot IT-investeringen.
#### 2.1.2 Return on Investment (ROI) en terugverdientijd
Naast de kosten is het ook belangrijk om de potentiële winst en de tijd die nodig is om de investering terug te verdienen te evalueren.
* **Terugverdientijd (Payback Period):** Dit is de periode waarin de initiële investering is terugverdiend. Factoren die de terugverdientijd beïnvloeden zijn onder meer de TCO, de tijd tot de markt (go-to-market time) en potentiële verkoopopbrengsten [13](#page=13).
* **Return on Investment (ROI):** ROI meet de winstgevendheid van een investering. Het antwoordt op de vraag welk rendement een investering zal opleveren [13](#page=13).
> **Tip:** Houd bij het berekenen van de terugverdientijd en ROI rekening met zowel directe als indirecte kosten en opbrengsten om een realistisch beeld te krijgen.
### 2.2 Beheer en onderhoud van datacenters
Het effectief beheren en onderhouden van een datacenter is essentieel voor de operationele stabiliteit en efficiëntie.
#### 2.2.1 Componenten van beheeruitgaven
De uitgaven voor datacenterbeheer kunnen worden onderverdeeld in verschillende categorieën [12](#page=12):
* Initiële systeem- en software-implementatie: 19% [12](#page=12).
* Planning voor upgrades, uitbreiding en capaciteit: 15% [12](#page=12).
* Upgrades, patches, etc.: 15% [12](#page=12).
* Systeemmonitoring: 13% [12](#page=12).
* Systeemonderhoud: 12% [12](#page=12).
* Migratie: 11% [12](#page=12).
* Onderhoud en tuning: 8% [12](#page=12).
* Overig: 7% [12](#page=12).
#### 2.2.2 Hoge beschikbaarheid versus rampenherstel (Disaster Recovery)
Er is een belangrijk onderscheid tussen het continu draaiende houden van systemen (hoge beschikbaarheid) en het herstellen van diensten na een calamiteit (rampenherstel).
* **Return to Operations (RTO):** Dit bepaalt hoeveel tijd nodig is om de bedrijfsvoering te hervatten na een incident [14](#page=14).
* **Recovery Point Objective (RPO):** Dit specificeert hoeveel data verloren mag gaan tijdens een incident [14](#page=14).
Het paradoxale aspect hierbij is dat hoge beschikbaarheid (HA) gericht is op het continu draaien totdat het echt niet meer kan, zelfs als dit datalekken veroorzaakt. Rampenherstel (DR) daarentegen focust op het stoppen van operaties wanneer datalekken dreigen, met als zekerheid dat de overeengekomen gegevens hersteld kunnen worden [14](#page=14).
**Voorbeelden van Hoge Beschikbaarheid (HA):**
* Redundante voeding [14](#page=14).
* Dubbele netwerkkaarten (NICs) [14](#page=14).
* RAID-configuraties voor opslag [14](#page=14).
* ECC RAM (Error-Correcting Code Memory) [14](#page=14).
**Voorbeelden van Rampenherstel (DR):**
* Data replicatie [14](#page=14).
> **Tip:** Zorg ervoor dat de RTO- en RPO-doelstellingen van uw organisatie realistisch zijn en aansluiten bij de bedrijfskritische behoeften en het beschikbare budget.
### 2.3 Datacenterbeveiliging
Beveiliging is een fundamenteel aspect van datacenterbeheer, met specifieke implementaties op verschillende niveaus. Dit omvat fysieke beveiliging, netwerkbeveiliging en data-integriteit [17](#page=17).
> **Tip:** Beveiliging is een doorlopend proces en vereist regelmatige updates, audits en training van personeel om effectief te blijven tegen nieuwe dreigingen.
### 2.4 Vergelijking TCO/ROI: Server/Datacenter versus Cloud
Het berekenen van de TCO en ROI verschilt significant tussen on-premise servers en datacenters enerzijds, en cloudoplossingen anderzijds. Cloudmodellen verschuiven vaak van CAPEX (kapitaaluitgaven) naar OPEX (operationele uitgaven), wat impact heeft op de financiële analyse [28](#page=28).
> **Tip:** Wanneer u cloudoplossingen evalueert, kijk dan verder dan de directe abonnementskosten en neem ook zaken als datatransferkosten, overhead en potentieel vendor lock-in mee in de TCO-berekening.
---
# Virtualisatie en software-defined technologieën
Dit onderwerp verkent de fundamentele concepten van virtualisatie op verschillende niveaus van de infrastructuur, inclusief processor-, geheugen- en opslagvirtualisatie, en beschrijft de evolutie naar software-defined datacenters en cloudoplossingen [19](#page=19) [20](#page=20).
### 3.1 Process virtualisatie
Process virtualisatie omvat het creëren van een virtuele omgeving voor de uitvoering van processen. Dit stelt meerdere besturingssystemen of toepassingen in staat om op één fysieke machine te draaien, waarbij de onderliggende hardware wordt geabstraheerd [20](#page=20).
### 3.2 Memory virtualisatie
Memory virtualisatie is het proces waarbij fysiek geheugen wordt geabstraheerd en beheerd op een manier die de beschikbare geheugenbronnen efficiënter toewijst en deelt tussen verschillende processen of virtuele machines. Een **hypervisor** speelt hierbij een cruciale rol, omdat deze het fysieke geheugen beheert en toewijst aan verschillende gastbesturingssystemen of applicaties. Het besturingssysteem en de applicaties communiceren met de hypervisor om toegang te krijgen tot het virtuele geheugen [21](#page=21).
### 3.3 Storage virtualisatie
Storage virtualisatie biedt een geconsolideerd en geabstraheerd beeld van opslagbronnen, ongeacht hun fysieke locatie of het onderliggende opslagmedium. Dit maakt flexibeler beheer van opslagruimte mogelijk, zoals het aggregeren van meerdere opslagapparaten tot één logische pool. Diensten zoals OneDrive, Dropbox en Google Drive maken gebruik van opslagvirtualisatie om gebruikers toegang te geven tot hun bestanden, vaak via cloudgebaseerde oplossingen. Een **SAN (Storage Area Network)** is een voorbeeld van een infrastructuur die opslagvirtualisatie kan ondersteunen [22](#page=22).
### 3.4 Software-defined networking (SDN)
Software-defined networking (SDN) is een benadering om netwerkbeheer te ontkoppelen van de onderliggende hardware, door middel van software. Dit stelt beheerders in staat om netwerkfunctionaliteit op een flexibele en programmeerbare manier te beheren, wat essentieel is voor moderne datacenters [23](#page=23).
### 3.5 Evolutie naar Software-Defined Datacenters en Cloud
De overgang naar software-defined technologieën heeft geleid tot de ontwikkeling van software-defined datacenters en cloudoplossingen. Dit proces verloopt vaak in stappen [19](#page=19) [24](#page=24) [25](#page=25) [26](#page=26):
* **Hyperconvergente infrastructuur (HCI):** Combineert compute, storage en netwerkfunctionaliteit in één geïntegreerde oplossing, beheerd door software [24](#page=24).
* **Hybride Cloud:** Een architectuur die verschillende cloudomgevingen (privé en publiek) integreert, waardoor gegevens en applicaties tussen deze omgevingen kunnen worden gemigreerd [19](#page=19) [25](#page=25).
* **Multicloud:** Het gebruik van meerdere cloud computing-diensten van verschillende cloudproviders. Dit biedt flexibiliteit en voorkomt vendor lock-in [19](#page=19) [26](#page=26).
Het concept van software-defined is essentieel omdat het een grotere mate van flexibiliteit, schaalbaarheid en efficiëntie in datacenteroperaties mogelijk maakt [27](#page=27).
> **Tip:** Begrijpen hoe virtualisatie op de verschillende lagen (processor, geheugen, opslag) werkt, is cruciaal om de basis te leggen voor software-defined architecturen.
> **Tip:** De verschillende cloudmodellen (private, public, hybrid, multi) zijn directe toepassingen van virtualisatie en software-defined principes op een grotere schaal.
[ ] bevat belangrijke conceptuele vragen die gerelateerd zijn aan dit onderwerp, zoals het verschil tussen redundantie, disaster recovery en veerkracht, beveiligingsimplementaties, totale eigendomskosten (TCO) en return on investment (ROI) in verschillende infrastructuren, en de specifieke kenmerken van DAS, SAN en NAS, evenals hyperconverged, hybride en multicloud-oplossingen [28](#page=28).
---
# Opslagtechnologieën en interfaces
Dit gedeelte behandelt de verschillende opslagoplossingen zoals DAS, NAS en SAN, en de onderliggende diskinterfaces zoals SATA en SAS, die cruciaal zijn voor dataopslag en -toegang in computersystemen [47](#page=47).
### 4.1 Opslagoplossingen
Er zijn drie primaire architecturen voor het aansluiten van opslagapparaten: Direct Attached Storage (DAS), Network Attached Storage (NAS) en Storage Area Network (SAN). Deze verschillen voornamelijk in hoe de opslag is verbonden met de servers en clients, en hoe deze wordt beheerd en geschaald. Cloudopslagoplossingen zoals OneDrive, Dropbox en Google Drive passen niet direct in deze klassieke indeling, omdat ze opereren op netwerkniveau en vaak abstractie bieden van de onderliggende fysieke opslaginfrastructuur [47](#page=47) [48](#page=48) [49](#page=49).
#### 4.1.1 Direct Attached Storage (DAS)
DAS is de meest basale vorm van opslag, waarbij opslagapparaten direct zijn verbonden met een enkele server [47](#page=47).
* **Kenmerken:**
* Eenvoudig in opzet en beheer [50](#page=50).
* Hoge prestaties, omdat de verbinding direct is en niet via een netwerk loopt [50](#page=50).
* Beperkte schaalbaarheid en flexibiliteit [50](#page=50).
* Beperkte mogelijkheden voor back-up en disaster recovery direct via de DAS-structuur [50](#page=50).
* **Componenten:**
* Kan bestaan uit interne schijven in de server of externe opslagbehuizingen (DAS enclosures) [51](#page=51).
* RAID-controllers worden vaak gebruikt binnen DAS-opstellingen om redundantie en prestaties te verbeteren [51](#page=51).
* JBODs (Just-a-Bunch-Of-Disks) zijn ook een vorm van DAS, waarbij schijven direct aan de controller worden gekoppeld zonder intelligente beheerfuncties [51](#page=51).
* **Connectiviteit:**
* DAS-behuizingen kunnen 'daisy-chained' worden, wat betekent dat meerdere behuizingen achter elkaar worden geschakeld op dezelfde controller [52](#page=52).
#### 4.1.2 Network Attached Storage (NAS)
NAS-apparaten zijn gespecialiseerde servers die aan een netwerk zijn gekoppeld en bestandsgebaseerde opslagdiensten leveren aan clients [47](#page=47).
* **Kenmerken:**
* Ideaal voor heterogene besturingssystemen (Windows, Unix/Linux) omdat ze netwerkprotocollen gebruiken [53](#page=53) [54](#page=54).
* Eenvoudig in te stellen en uit te breiden met capaciteit [54](#page=54).
* **Nadelen:**
* Schaalbaarheid van prestaties kan problematisch zijn, vaak gelimiteerd door de netwerkbandbreedte (LAN-prestaties) [54](#page=54).
* Gevoelig voor 'reguliere' netwerkprestatieverminderingen, zoals die kunnen optreden bij het gebruik van Wi-Fi [54](#page=54).
* **Protocollen:**
* Common Internet File System (CIFS), voornamelijk gebruikt door Windows [53](#page=53).
* Network File System (NFS), voornamelijk gebruikt door Unix-achtige systemen [53](#page=53).
* Deze protocollen maken het mogelijk om NAS-volumes 'te mounten' via het netwerk [53](#page=53).
#### 4.1.3 Storage Area Network (SAN)
Een SAN is een apart, high-speed netwerk dat exclusief is bedoeld voor opslagverkeer, waardoor servers toegang krijgen tot blok-gebaseerde opslag [47](#page=47) [55](#page=55).
* **Kenmerken:**
* **Voordelen:**
* Biedt hoge redundantie en disaster recovery mogelijkheden [57](#page=57) [60](#page=60).
* Goede beheerbaarheid door centralisatie [60](#page=60).
* Prestaties en capaciteit schalen zeer goed op, zonder afname van link-snelheid [60](#page=60).
* Grote afstanden tussen servers en opslag zijn mogelijk [60](#page=60).
* **Nadelen:**
* LUN (Logical Unit Number) sharing vereist ondersteuning van het bestandssysteem [60](#page=60).
* Duurder in aanschaf en implementatie dan DAS of NAS [60](#page=60).
* Bedrijven die reeds in Fibre Channel hebben geïnvesteerd, kunnen terughoudend zijn met het overstappen naar iSCSI [60](#page=60).
* **Technologieën:**
* **Fibre Channel (FC):** Een gevestigde SAN-oplossing die sinds 1994 bestaat. Vereist zowel specifieke hardware als software [58](#page=58).
* **iSCSI (Internet Small Computer System Interface):** Werd populair rond 2006 en maakt gebruik van standaard Ethernet-hardware, hoewel specifieke software vereist is [58](#page=58).
* **iSCSI Terminologie:**
* **iSCSI Target:** De 'server' die opslag (LUNs) aanbiedt binnen het SAN [59](#page=59).
* **iSCSI Initiator:** De 'client' die verbinding maakt met de LUNs die door de targets worden aangeboden. Initiator software is vaak ingebouwd in besturingssystemen zoals Windows [59](#page=59).
> **Tip:** SAN is ideaal voor omgevingen die hoge prestaties, schaalbaarheid en beschikbaarheid vereisen, zoals databaseservers en virtualisatieclusters.
### 4.2 Diskinterfaces
Diskinterfaces bepalen hoe opslagapparaten fysiek worden aangesloten op het moederbord en hoe data wordt overgedragen. De belangrijkste interfaces worden hieronder besproken [61](#page=61).
#### 4.2.1 Parallel ATA (PATA)
PATA, voorheen bekend als IDE (Integrated Drive Electronics), was een veelgebruikte interface voor harde schijven en CD/DVD-spelers in werkstation- en desktopcomputers [63](#page=63).
* **Kenmerken:**
* Elke IDE-connector ondersteunt maximaal twee apparaten [63](#page=63).
* Moederborden konden 1, 2 of 4 IDE-connectoren hebben [63](#page=63).
* Maximale kabellengte is beperkt tot 45 centimeter voor platte kabels [63](#page=63).
#### 4.2.2 SCSI (Small Computer System Interface)
SCSI is een oudere interface die zowel interne als externe aansluitingen kon bieden. Hoewel het vroeger veel gebruikt werd voor harde schijven, CD-spelers en scanners, wordt het tegenwoordig voornamelijk nog gebruikt voor tapestreamers [64](#page=64).
* **Kenmerken:**
* Gebruikt parallelle SCSI-commando's, waarbij de SCSI-bus open blijft tijdens de communicatie [64](#page=64).
* Er zijn veel verschillende connectortypen, wat compatibiliteitsproblemen kan veroorzaken [64](#page=64).
* **SCSI Adressering:**
* Per SCSI-controller kunnen maximaal 15 apparaten worden aangesloten, elk met een unieke SCSI ID (0 tot 15) [65](#page=65).
* ID nummer 7 is gereserveerd voor de controller zelf [65](#page=65).
* Apparaten kunnen worden aangesproken met een combinatie van controller-ID en SCSI-ID, bijvoorbeeld SCSI 0:2 [65](#page=65).
#### 4.2.3 Serial ATA (SATA)
SATA is de opvolger van PATA en wordt veelvuldig gebruikt in desktopcomputers en laptops. Het maakt gebruik van seriële datatransmissie, wat resulteert in dunnere en flexibelere kabels, wat gunstig is voor airflow [62](#page=62) [66](#page=66).
* **Versies en Snelheden:**
* SATA150 (1.5 Gbps) - ook wel SATA I genoemd [66](#page=66).
* SATA300 (3 Gbps) - ook wel SATA II genoemd [66](#page=66).
* SATA600 (6 Gbps) - ook wel SATA III genoemd [66](#page=66).
* **Voeding:**
* SATA Power connector levert 12V, 5V en 3.3V spanning [66](#page=66).
* **Connectiviteit:**
* SATA is een point-to-point systeem, geen bus-systeem, wat betekent dat er geen speciale adressering nodig is: één connector, één kabel, één apparaat [67](#page=67).
* Dit vermindert het risico op prestatievermindering [67](#page=67).
* Externe SATA-connectoren (eSATA) zijn mogelijk, maar vereisen externe voeding. Dit is opgelost met eSATAp, een combinatie van eSATA en USB [67](#page=67).
* De praktische kabellengte voor SATA is ongeveer 50 centimeter [67](#page=67).
> **Belangrijke Opmerking:** Er kan een aanzienlijk verschil zijn tussen de mechanische SATA-connector en het daadwerkelijke protocol dat over de draad wordt gebruikt. Een SATA-poort kan bijvoorbeeld ook PCI Express-protocollen ondersteunen, vooral bij modernere interfaces zoals M.2 [68](#page=68).
#### 4.2.4 M.2
M.2 is een relatief nieuwe interface die zich tussen 'Drive interfaces' en 'Card Slot interfaces' bevindt. Het maakt intern gebruik van een PCI Express x4 interface, maar kan twee verschillende protocollen ondersteunen: SATA of NVMe [69](#page=69).
#### 4.2.5 Serial Attached SCSI (SAS)
SAS is de opvolger van parallel SCSI en gebruikt SCSI-protocollen over een seriële interface die lijkt op SATA [62](#page=62) [70](#page=70).
* **Voordelen ten opzichte van SCSI:**
* SAS is point-to-point, in tegenstelling tot de bus-structuur van SCSI, wat prestatieverminderingen en 'skews' voorkomt [70](#page=70).
* Ondersteunt meer apparaten per controller (tot 65.535, vergeleken met 16 voor SCSI) [70](#page=70).
* Hogere snelheden [70](#page=70).
* **Compatibiliteit:**
* SAS is onder bepaalde omstandigheden compatibel met SATA [70](#page=70).
* **Connectoren:**
* Diverse connectortypen bestaan, zowel intern (zoals SFF-8482, SFF-8087) als extern (zoals SFF-8470, SFF-8088) [71](#page=71).
#### 4.2.6 U.2
U.2 (voorheen SFF-8639) is een interface die ook wordt gebruikt voor opslagapparaten, met name in enterprise-omgevingen, en kan verschillende protocollen ondersteunen, waaronder NVMe over PCI Express [62](#page=62).
### 4.3 Toekomstige Ontwikkelingen
* **NVM Express (NVMe):** Non-Volatile Memory Express wordt steeds belangrijker in servers, met name via PCI Express storage cards en in de toekomst mogelijk via geheugenkaarten. Hoewel M.2 en U.2 ook NVMe kunnen gebruiken, zijn dedicated PCI Express SSD's momenteel populairder in enterprise-opstellingen [73](#page=73).
* **Quantum Computers:** Quantum computing belooft een fundamentele verschuiving in opslagtechnologie, waarbij een bit veel meer informatie dan een enkel 0 of 1 zal kunnen opslaan, waardoor de traditionele scheiding tussen RAM en opslag mogelijk verdwijnt [6](#page=6).
---
## Veelgemaakte fouten om te vermijden
- Bestudeer alle onderwerpen grondig voor examens
- Let op formules en belangrijke definities
- Oefen met de voorbeelden in elke sectie
- Memoriseer niet zonder de onderliggende concepten te begrijpen
Glossary
| Term | Definition |
|------|------------|
| CPU (Central Processing Unit) | De centrale verwerkingseenheid, het "brein" van een computer dat basisberekeningen uitvoert zoals laden, opslaan, vergelijken, optellen, aftrekken en verschuiven. |
| RAM (Random Access Memory) | Snel, vluchtig geheugen dat dicht bij de CPU staat en wordt gebruikt om data op te slaan tijdens de actieve werking. Vanwege de hoge kosten is de capaciteit beperkter dan bij opslag. |
| Opslag (Storage) | Persistent geheugen dat data opslaat, zelfs wanneer de stroom is uitgeschakeld. Dit is langzamer dan RAM maar biedt een veel hogere capaciteit tegen lagere kosten. |
| Cache | Een kleine, zeer snelle geheugenopslag die zich binnen de CPU of in de buurt van de opslag bevindt, gebruikt voor verdere optimalisatie van datatoegang. |
| Bus topologie | Een netwerkarchitectuur waarbij meerdere apparaten communiceren over een gemeenschappelijk medium, zoals bij SCSI of PATA. |
| Point-to-point communicatie | Een netwerkarchitectuur waarbij twee apparaten direct met elkaar communiceren via een privéverbinding, zoals bij SAS en SATA. |
| Ring topologie | Een netwerkarchitectuur waarbij elk apparaat deel uitmaakt van een lus en pakketten doorstuurt die niet voor hem bestemd zijn, zoals bij Fibre Channel. |
| Throughput | De hoeveelheid data die gedurende een bepaalde periode kan worden getransporteerd, een belangrijke maatstaf voor de snelheid van interconnecties en communicatie. |
| Latency | De snelheid waarmee een enkele data-eenheid kan worden getransporteerd, eveneens een cruciale factor voor prestaties bij data-overdracht. |
| RAS (Reliability-Availability-Serviceability) | Een reeks kenmerken die essentieel zijn voor servers en datacenters, gericht op het waarborgen van betrouwbaarheid, beschikbaarheid en onderhoudsgemak. |
| TCO (Total Cost of Ownership) | De totale kosten die gepaard gaan met het bezitten en exploiteren van een IT-systeem gedurende de levenscyclus, inclusief hardware, software, datacenter- en beheerkosten. |
| ROI (Return on Investment) | De winst die een investering genereert ten opzichte van de kosten ervan, een financiële maatstaf om de rentabiliteit van een project te beoordelen. |
| RTO (Recovery Time Objective) | De maximale toegestane tijd die een bedrijf mag nemen om na een storing weer operationeel te zijn. |
| RPO (Recovery Point Objective) | Het maximale verlies aan data dat acceptabel is na een storing, bepaald door het tijdstip tot wanneer data hersteld moet kunnen worden. |
| High Availability (HA) | Systemen die ontworpen zijn om continu te blijven draaien, zelfs bij falen van individuele componenten, om operationele continuïteit te garanderen. |
| Disaster Recovery (DR) | Een plan en set procedures om de IT-infrastructuur en bedrijfsprocessen te herstellen na een grote ramp of storing. |
| Datacenter | Een faciliteit die wordt gebruikt om grote hoeveelheden computersystemen en bijbehorende componenten, zoals servers, opslagsystemen en netwerkapparatuur, te huisvesten en te beheren. |
| Virtualisatie | Het proces van het creëren van een virtuele (software-gebaseerde) versie van iets, zoals een besturingssysteem, server, opslagapparaat of netwerkbronnen, in plaats van een fysieke versie. |
| Hypervisor | Software die het mogelijk maakt om meerdere besturingssystemen tegelijkertijd op één enkele fysieke machine te draaien. |
| Storage Virtualization | Het aggregeren van fysieke opslag uit meerdere netwerkopslagapparaten in wat lijkt op één enkel opslagapparaat dat wordt beheerd vanaf een centrale locatie. |
| Software-Defined Networking (SDN) | Een benadering van netwerkbeheer die programma-gestuurde netwerkconfiguratie, gedrag en beheer mogelijk maakt door de controle- en dataplane te scheiden. |
| Hyperconverged Infrastructure (HCI) | Een IT-infrastructuur die server-, opslag- en netwerkfunctionaliteit combineert in een software-gedefinieerde, geïntegreerde oplossing. |
| Hybrid Cloud | Een cloud computing-omgeving die een combinatie gebruikt van on-premises (private cloud) en publieke cloud services, met orkestratie tussen de twee. |
| Multi-Cloud | Het gebruik van cloud computing-diensten van meerdere cloud providers (bijvoorbeeld AWS, Azure, Google Cloud) om te profiteren van hun specifieke sterktes en om vendor lock-in te vermijden. |
| Rack Mount Server | Een server die is ontworpen om gemonteerd te worden in een standaard 19-inch rackkast, waarbij de hoogte wordt gemeten in 'U' (eenheden). |
| Blade Server | Een modulaire server die in een chassis wordt geplaatst en componenten zoals voeding, koeling en netwerkverbindingen deelt met andere blades in hetzelfde chassis. |
| IPMI (Intelligent Platform Management Interface) | Een standaard voor out-of-band beheer van servers, waarmee beheer op hardwareniveau kan worden uitgevoerd, ongeacht de status van het besturingssysteem. |
| iLO (Integrated Lights-Out) | Hewlett Packard Enterprise's specifieke implementatie van out-of-band serverbeheer, vergelijkbaar met IPMI. |
| iDRAC (Integrated Dell Remote Access Controller) | Dell's specifieke implementatie van out-of-band serverbeheer, vergelijkbaar met IPMI. |
| SAS (Serial Attached SCSI) | Een interfaceprotocol dat SCSI-commando's over een seriële verbinding transporteert, ontworpen voor hogere prestaties en meer apparaten per controller dan parallelle SCSI. |
| SATA (Serial Advanced Technology Attachment) | Een seriële interface die voornamelijk wordt gebruikt voor het verbinden van opslagapparaten zoals harde schijven en SSD's met het moederbord. |
| DAS (Direct Attached Storage) | Opslag die direct is aangesloten op een enkele computer of server, zonder een netwerkverbinding. |
| NAS (Network Attached Storage) | Een opslagapparaat dat via een netwerk (meestal Ethernet) toegankelijk is voor meerdere clients en bestandsgebaseerde toegang biedt. |
| SAN (Storage Area Network) | Een speciaal netwerk (vaak Fibre Channel of iSCSI) dat exclusief is toegewijd aan opslagapparaten, en block-level toegang biedt aan servers. |
| iSCSI (Internet Small Computer System Interface) | Een protocol dat SCSI-commando's over TCP/IP-netwerken transporteert, waardoor SAN-functionaliteit over standaard Ethernet kan worden gerealiseerd. |
| LUN (Logical Unit Number) | Een identificatienummer dat wordt gebruikt om een opslagapparaat of -volume binnen een SAN of bij iSCSI te identificeren. |
| NVRAM (Non-Volatile Random Access Memory) | Een type RAM dat gegevens behoudt, zelfs wanneer de stroom is uitgeschakeld, vaak gebruikt voor configuratiegegevens. |
| PCI Express (PCIe) | Een snelle seriële computeruitbreidingsbusstandaard die de PCI-bus vervangt en gebruikt wordt voor grafische kaarten, netwerkkaarten en andere uitbreidingskaarten. |
Cover
1ZT-ICT-H3-P1.pptx
Summary
# Historische ontwikkeling van computers
Dit onderwerp traceert de evolutie van computerhardware, beginnend bij vroege rekenmachines en eindigend bij moderne apparaten zoals smartphones en het Internet of Things.
### 1.1 Vroege rekenmachines
#### 1.1.1 Rekenmachine van Von Leibniz
De rekenmachine van Von Leibniz, ontwikkeld rond 1672, was een vroege mechanische rekenmachine die in staat was tot optellen, aftrekken, vermenigvuldigen en delen. Dit markeerde een belangrijke stap in de automatisering van berekeningen.
#### 1.1.2 Analytische machine van Charles Babbage
Charles Babbage ontwierp rond 1834 de Analytische machine. Deze machine wordt beschouwd als de eerste general-purpose computer en bestond uit vier fundamentele onderdelen:
* **Geheugen (memory):** Voor het opslaan van gegevens en tussenresultaten.
* **Loopwerk (berekeningseenheid):** De eenheid die de berekeningen uitvoert.
* **Invoer (input):** Gebruikmakend van ponskaarten om instructies en data in te voeren.
* **Uitvoer (output):** Geponste kaarten of gedrukte resultaten als uitvoer.
### 1.2 De elektronische computer
#### 1.2.1 ENIAC (Electronic Numerical Integrator and Computer)
De ENIAC, voltooid in 1946, was een van de eerste elektronische digitale computers. Deze machine was enorm groot en zwaar, met een gewicht van 30 ton en een stroomverbruik van 140 kilowatt. De ENIAC maakte gebruik van relais.
### 1.3 De transistor en VLSI
#### 1.3.1 Transistorcomputers
In de jaren 1960 kwamen de eerste computer met transistoren op de markt, zoals de PDP-1. Deze computers waren aanzienlijk kleiner en krachtiger dan hun voorgangers. De PDP-1 kon 4096 woorden aan geheugen opslaan en voerde ongeveer 200.000 berekeningen per seconde uit.
#### 1.3.2 VLSI (Very Large Scale Integration)
De ontwikkeling van VLSI-technologie in de jaren 1980 maakte het mogelijk om een zeer groot aantal transistoren op een kleine chip te integreren. Dit leidde tot aanzienlijk krachtigere en compactere computers.
### 1.4 Moderne computerontwikkelingen
#### 1.4.1 De Wet van Moore
De Wet van Moore, die observeert dat het aantal transistoren op een geïntegreerde schakeling ongeveer elke twee jaar verdubbelt, heeft de exponentiële groei van computerkracht en -capaciteit gedurende vele decennia aangedreven, met name rond het jaar 2000.
#### 1.4.2 Toekomst: Quantum Computing
Vooruitkijkend wordt quantum computing (rond 2040?) gezien als de volgende grote sprong in de computertechnologie, met het potentieel om problemen op te lossen die voor huidige computers onmogelijk zijn.
### 1.5 Evolutie van computeraars en apparaten
#### 1.5.1 Persoonlijke Computers en Laptops
* **1977:** De introductie van de personal computer (PC) maakte geavanceerde computerkracht toegankelijk voor individuele gebruikers.
* **1981:** De laptop bood mobiele computerfunctionaliteit, waardoor gebruikers konden werken en berekeningen konden uitvoeren zonder gebonden te zijn aan een vaste locatie.
#### 1.5.2 Mobiele en Draagbare Apparaten
* **1984:** De Personal Digital Assistant (PDA) was een vroege vorm van een draagbaar digitaal apparaat, ontworpen voor taken zoals agenda, contacten en notities.
* **1993:** Mobiele telefoons begonnen te ontstaan, aanvankelijk primair voor communicatie, maar legden de basis voor toekomstige slimme apparaten.
* **1994:** Spelconsoles ontwikkelden zich als gespecialiseerde apparaten voor entertainment, wat de groeiende diversificatie van hardwareproducten illustreert.
* **2007:** De smartphone integreerde de functionaliteit van een telefoon, computer en internettoegang in één draagbaar apparaat, wat een revolutie teweegbracht in mobiele technologie.
* **2015:** Het Internet of Things (IoT) begon aan populariteit te winnen, waarbij alledaagse objecten werden uitgerust met sensoren en connectiviteit, waardoor ze gegevens konden verzamelen en uitwisselen.
---
# Fundamentele onderdelen van de Analytical Engine
Dit onderwerp belicht de vier kerncomponenten van Charles Babbage's Analytical Engine, een vroege voorloper van de moderne computer.
### 2.1 De vier kerncomponenten
De Analytical Engine, ontworpen door Charles Babbage, wordt beschouwd als de eerste blauwdruk voor een algemeen toepasbare computer. Het apparaat was opgebouwd uit vier fundamentele onderdelen die essentieel waren voor zijn functionaliteit:
#### 2.1.1 Geheugen (memory)
Dit onderdeel diende als opslagplaats voor getallen en gegevens die de machine nodig had voor berekeningen. Het geheugen kon gegevens vasthouden die gedurende het rekenproces gebruikt moesten worden.
#### 2.1.2 Loopwerk (berekeningseenheid)
Het loopwerk, ook wel de berekeningseenheid genoemd, was verantwoordelijk voor het uitvoeren van de rekenkundige bewerkingen. Dit omvatte basisoperaties zoals optellen, aftrekken, vermenigvuldigen en delen. Dit was het "brein" van de machine waar de daadwerkelijke berekeningen plaatsvonden.
#### 2.1.3 Invoer (input)
De invoer van gegevens en instructies in de Analytical Engine werd verzorgd via ponskaarten. Deze methode, die ook in andere mechanische apparaten uit die tijd werd gebruikt, maakte het mogelijk om programma's en data op een gestructureerde manier aan de machine aan te bieden.
#### 2.1.4 Uitvoer (output)
De resultaten van de berekeningen konden op verschillende manieren worden gepresenteerd. De uitvoer van de Analytical Engine bestond uit geponste kaarten of een gedrukte uitvoer, waardoor de resultaten van de machine zichtbaar en bruikbaar werden gemaakt.
> **Tip:** Begrijpen hoe deze vier onderdelen samenwerkten, is cruciaal voor het bevatten van het baanbrekende ontwerp van Babbage. Het is een fundamenteel concept voor het begrijpen van de evolutie van computers.
---
# Vooruitgang in computertechnologie en miniaturisatie
Dit onderwerp behandelt de belangrijkste technologische sprongen die hebben geleid tot krachtigere en kleinere computers, zoals de introductie van transistoren, VLSI en de Wet van Moore.
### 3.1 Historische ontwikkeling van computers
De ontwikkeling van computers kent een rijke geschiedenis met significante mijlpalen die de weg hebben geplaveid voor de moderne technologie.
#### 3.1.1 Vroege rekenmachines
* **1672 - Von Leibniz Rekenmachine:** In staat tot optellen, aftrekken, vermenigvuldigen en delen.
* **1834 - Charles Babbage's Analytical Engine:** Wordt beschouwd als de eerste general-purpose computer met vier fundamentele onderdelen:
* Geheugen (memory)
* Loopwerk (berekeningseenheid)
* Invoer (via ponskaarten)
* Uitvoer (geponste of gedrukte uitvoer)
#### 3.1.2 De opkomst van elektronische computers
* **1946 - ENIAC (Electronic Numerical Integrator And Computer):** Een vroege elektronische computer die tijdens de Tweede Wereldoorlog werd ontwikkeld. Deze machine woog 30 ton en verbruikte 140 kilowatt, wat typisch was voor relaiscomputers uit die tijd.
#### 3.1.3 Transistorcomputers
* **1960 - Eerste computer met transistoren (bv. PDP-1):** De introductie van transistoren markeerde een cruciale stap in de richting van kleinere en efficiëntere computers. De PDP-1 had een geheugen van 4096 woorden en kon 200.000 berekeningen per seconde uitvoeren.
#### 3.1.4 VLSI (Very Large Scale Integration)
* **1980 - VLSI:** Deze technologie maakte het mogelijk om een zeer groot aantal transistoren op een kleine ruimte te integreren. Dit was een sleutelontwikkeling die de rekenkracht exponentieel deed toenemen en de grootte van computers drastisch verkleinde.
#### 3.1.5 Wet van Moore
* **2000 - Wet van Moore:** Hoewel niet strikt een technologische sprong, beschrijft de Wet van Moore de empirische observatie dat het aantal transistoren op een geïntegreerde schakeling ongeveer elke twee jaar verdubbelt. Dit heeft geleid tot een continue cyclus van verhoogde rekenkracht en verkleining van computerapparatuur.
#### 3.1.6 Toekomstige ontwikkelingen
* **2040? - Quantum Computing:** Dit vertegenwoordigt een potentiële toekomstige generatie computertechnologie die gebruikmaakt van kwantummechanische principes.
### 3.2 Evolutie van computerapparaten
De technologische vooruitgang heeft geleid tot een diversificatie en miniaturisatie van computerapparaten, waardoor computing steeds toegankelijker en draagbaarder is geworden.
* **1977 - PC (Personal Computer):** De introductie van de personal computer maakte computerkracht toegankelijk voor individuele gebruikers.
* **1981 - Laptop:** De ontwikkeling van laptops bood draagbare computeroplossingen.
* **1984 - PDA (Personal Digital Assistant):** Deze apparaten combineerden functionaliteit zoals agenda's, contactpersonen en notities in een draagbaar formaat.
* **1993 - Mobiele Telefoons:** De opkomst van mobiele telefoons begon de communicatie en computerfunctionaliteit te integreren in één apparaat.
* **1994 - Spelconsoles:** Specialisatie in entertainment met krachtige grafische en rekenmogelijkheden.
* **2007 - Smartphones:** Een significante evolutie die geavanceerde computerfunctionaliteit, internettoegang en communicatie in een handzaam apparaat combineerde.
* **2015 - IoT (Internet of Things):** De integratie van computersystemen in alledaagse objecten, waardoor deze gegevens kunnen verzamelen en uitwisselen.
> **Tip:** Begrip van de Wet van Moore is essentieel om de exponentiële groei in computerprestaties en de impact daarvan op miniaturisatie te kunnen plaatsen. De constante verdubbeling van transistoren op chips verklaart waarom apparaten steeds krachtiger en kleiner worden.
---
## Veelgemaakte fouten om te vermijden
- Bestudeer alle onderwerpen grondig voor examens
- Let op formules en belangrijke definities
- Oefen met de voorbeelden in elke sectie
- Memoriseer niet zonder de onderliggende concepten te begrijpen
Glossary
| Term | Definition |
|------|------------|
| Rekenmachine | Een apparaat dat ontworpen is voor het uitvoeren van rekenkundige bewerkingen zoals optellen, aftrekken, vermenigvuldigen en delen. |
| Analytical Engine | Een mechanische universele computer ontworpen door Charles Babbage, die gezien wordt als de voorloper van moderne computers. |
| Geheugen (memory) | Een component van een computer die gebruikt wordt om gegevens en instructies op te slaan die de processor nodig heeft voor het uitvoeren van taken. |
| Loopwerk (berekeningseenheid) | Het deel van een computer dat verantwoordelijk is voor het uitvoeren van rekenkundige en logische bewerkingen. |
| Invoer (ponskaarten) | Een methode om gegevens en instructies in een computer in te voeren door middel van gatenpatronen op kaarten. |
| Uitvoer (geponste of gedrukte uitvoer) | De manier waarop een computer resultaten presenteert, hetzij in geponste vorm op kaarten, hetzij als gedrukte output. |
| ENIAC | De Electronic Numerical Integrator And Computer, een van de eerste elektronische digitale computers, gebouwd in de jaren 40. |
| Transistorcomputers | Computers die gebruik maken van transistoren als elektronische schakelaars, wat een aanzienlijke verbetering betekende ten opzichte van eerdere relaiscomputers. |
| VLSI (Very Large Scale Integration) | Een technologische vooruitgang waarbij een zeer groot aantal transistoren op één enkele chip geïntegreerd kan worden. |
| Wet van Moore | Een voorspelling die stelt dat het aantal transistoren op een microchip ongeveer elke twee jaar verdubbelt, wat resulteert in exponentiële groei van rekenkracht en afname van kosten. |
| Quantum Computing | Een nieuw type computer dat gebruik maakt van kwantummechanische principes zoals superpositie en verstrengeling om berekeningen uit te voeren. |
| PC (Persoonlijke Computer) | Een computer ontworpen voor individueel gebruik, die wijdverspreid raakte vanaf de late jaren 70. |
| Laptop | Een draagbare computer die alle componenten van een desktopcomputer bevat in een compacte vorm. |
| PDA (Personal Digital Assistant) | Een draagbaar elektronisch apparaat dat fungeert als persoonlijke organisator, met functies zoals een agenda, adresboek en notities. |
| Mobiele Telefoons | Draagbare telecommunicatieapparaten die oorspronkelijk bedoeld waren voor spraakcommunicatie, maar evolueerden naar multifunctionele apparaten. |
| Spelconsoles | Gespecialiseerde computersystemen die primair ontworpen zijn voor het spelen van videogames. |
| Smartphones | Geavanceerde mobiele telefoons met uitgebreide computerfuncties, waaronder internettoegang, applicaties en multimedia. |
| IoT (Internet of Things) | Een netwerk van fysieke objecten die zijn uitgerust met sensoren, software en andere technologieën om gegevens te verzamelen en uit te wisselen met andere apparaten en systemen via het internet. |
Cover
1ZT-ICT-H3-P2.pptx
Summary
# Introductie tot computerhardware
Dit gedeelte biedt een fundamentele introductie tot de belangrijkste componenten van computerhardware, met een focus op het moederbord, de processor en het geheugen.
## 1. Introductie tot computerhardware
### 1.1 Moederbord
Het moederbord is het centrale zenuwstelsel van een computersysteem en fungeert als de ruggengraat die alle andere componenten met elkaar verbindt en communiceert.
#### 1.1.1 Keuze van een moederbord
De keuze voor een specifiek moederbord hangt af van diverse systeemvereisten, zoals de hoeveelheid benodigde RAM, de compatibiliteit met randapparatuur en het type processor dat gebruikt zal worden. Een voorbeeld van een moederbord is de Asus Pro B760M-C-CSM.
### 1.2 Processor
De processor, ook wel de centrale verwerkingseenheid (CPU) genoemd, is het hart van het computersysteem en verantwoordelijk voor het uitvoeren van instructies.
#### 1.2.1 Componenten van de processor
Een processor bestaat doorgaans uit de volgende kerncomponenten:
* **ALU (Arithmetic Logical Unit)**: De rekeneenheid die rekenkundige en logische bewerkingen uitvoert.
* **BE (Besturingseenheid)**: De besturingseenheid die instructies ophaalt, decodeert en de andere componenten aanstuurt.
* **REG (Registers)**: Kleine, snelle geheugens in de processor die tijdelijk data en instructies opslaan voor directe toegang.
#### 1.2.2 De Von Neumann Cyclus
De Von Neumann-cyclus beschrijft het fundamentele proces van hoe een processor instructies verwerkt:
1. **Instructie ophalen**: De besturingseenheid haalt de volgende instructie op uit het hoofdgeheugen.
2. **Instructie vertalen en operanden ophalen**: De instructie wordt vertaald en de benodigde data (operanden) worden opgehaald en in registers geplaatst.
3. **Bewerking uitvoeren**: De besturingseenheid geeft de ALU opdracht om de bewerking uit te voeren met de data in de registers.
4. **Resultaat opslaan**: Het resultaat van de bewerking wordt opgeslagen in een register, teruggeplaatst in het hoofdgeheugen, of gebruikt voor een volgende instructie.
#### 1.2.3 Kloksnelheid
De kloksnelheid van een processor, gemeten in Hertz (Hz), bepaalt hoe snel de processor instructies kan verwerken. Een hogere kloksnelheid betekent over het algemeen een snellere verwerking.
#### 1.2.4 Multi-CPU vs. Multi-Core
* **Multi-CPU**: Verwijst naar systemen met meerdere fysieke processors.
* **Multi-Core**: Verwijst naar systemen waarbij één fysieke processor meerdere rekenkernen bevat.
#### 1.2.5 Socket
De processor wordt in een specifieke socket op het moederbord geplaatst. De socket moet compatibel zijn met het type processor.
#### 1.2.6 Keuze van een processor
Bij de keuze van een processor zijn factoren zoals kloksnelheid, compatibiliteit met het moederbord (socket) en geheugenvereisten cruciaal.
### 1.3 Geheugen
Geheugen in een computer dient voor het opslaan van data in binaire vorm (0 of 1). Dit kan permanent of tijdelijk (vluchtig) zijn. Hieronder vallen programma's, documenten, bibliotheken en firmware.
#### 1.3.1 RAM (Random Access Memory)
RAM is een vluchtig type geheugen dat zowel gelezen als geschreven kan worden. De data gaat verloren zodra de stroom wordt uitgeschakeld.
##### 1.3.1.1 SRAM vs. DRAM
* **SRAM (Static RAM)**: Is zeer snel maar duurder. Het wordt vaak gebruikt voor cachegeheugen in processors.
* **DRAM (Dynamic RAM)**: Is trager dan SRAM maar goedkoper en wordt gebruikt als hoofdgeheugen.
##### 1.3.1.2 DDRRAM
DDRRAM (Double Data Rate Random Access Memory) is een veelvoorkomende technologie die gegevens op zowel de stijgende als dalende flank van de klokcyclus overdraagt, wat resulteert in een hogere doorvoersnelheid vergeleken met Single Data Rate (SDR) RAM. Hedendaagse systemen gebruiken typisch DDR4 en DDR5 RAM.
##### 1.3.1.3 Vormen van DRAM
* **Slotted**: Wordt gebruikt in pc's en laptops.
* **DIMM (Dual Inline Memory Module)**: Typisch voor pc's.
* **SODIMM (Small Outline DIMM)**: Typisch voor laptops.
* **Embedded**: Geïntegreerd in kleinere apparaten en IoT-apparaten.
> **Tip:** Verschillende DDR-versies (bijvoorbeeld DDR4 en DDR5) zijn mechanisch en elektrisch niet compatibel met elkaar.
##### 1.3.1.4 Keuze van RAM
Bij het kiezen van RAM zijn de grootte (in gigabytes) belangrijk voor de toepassing, de fysieke vorm (afhankelijk van de beschikbare ruimte en het apparaat), en de DDR-versie en snelheid (afhankelijk van de processor en het moederbord).
#### 1.3.2 ROM (Read Only Memory)
ROM is een type geheugen dat permanent is (data blijft behouden zonder stroom) en alleen gelezen kan worden.
* **Toepassingen**: Firmware voor apparaten zoals wasmachines en ovens, en de BIOS (Basic Input/Output System) van computers.
* **Evolutie**: ROM evolueerde van PROM (Programmable ROM) naar EPROM (Erasable PROM) en vervolgens naar EEPROM (Electrically Erasable PROM), wat flexibiliteit in updates mogelijk maakte.
### 1.4 Computerbussen
Computerbussen zijn elektronische paden die de verschillende componenten binnen een computersysteem met elkaar verbinden en digitale data versturen. Er bestaan diverse bussen met verschillende snelheden, seriële of parallelle communicatie, en oudere of modernere standaarden.
#### 1.4.1 Chipset
De chipset functioneert als een "middle man" tussen de verschillende bussen en andere componenten. Deze is meestal geïntegreerd in het moederbord of de processor. Voorbeelden van verbindingen die via de chipset lopen zijn:
* **Front Side Bus (FSB)**: Tussen de processor en de Northbridge.
* **Memory Bus**: Tussen het geheugen en de Northbridge.
* **Randapparatuur bussen**: Zoals PCI en SATA, en USB, die via de Southbridge worden beheerd.
---
# Componenten van de processor
Dit gedeelte focust specifiek op de interne componenten van de processor, de centrale verwerkingseenheid, en gerelateerde concepten.
### 2.1 De centrale verwerkingseenheid (CPU)
De processor, ook wel de centrale verwerkingseenheid (CPU) genoemd, is het hart van een computersysteem. De belangrijkste functies worden uitgevoerd door de interne componenten van de processor.
#### 2.1.1 Kerncomponenten van de processor
De processor bevat doorgaans de volgende kerncomponenten:
* **Arithmetic Logical Unit (ALU):** Dit is de berekeningseenheid van de processor. De ALU is verantwoordelijk voor het uitvoeren van alle rekenkundige bewerkingen (zoals optellen, aftrekken, vermenigvuldigen, delen) en logische bewerkingen (zoals AND, OR, NOT).
* **Besturingseenheid (BE):** De besturingseenheid coördineert en stuurt alle activiteiten binnen de processor en de communicatie met andere componenten van het systeem. Het vertaalt instructies en bepaalt de volgorde waarin deze worden uitgevoerd.
* **Registers (REG):** Registers zijn kleine, zeer snelle geheugenelementen binnen de processor. Ze worden gebruikt om tijdelijk gegevens en instructies op te slaan die op dat moment actief worden verwerkt, wat de verwerkingssnelheid ten goede komt.
#### 2.1.2 De Von Neumann cyclus
De Von Neumann cyclus beschrijft het fundamentele proces waarmee een processor programma-instructies ophaalt en uitvoert. Dit is een iteratief proces dat bestaat uit de volgende stappen:
1. **Instructie ophalen (Fetch):** De besturingseenheid haalt de volgende instructie op uit het hoofdgeheugen (RAM).
2. **Instructie decoderen (Decode):** De besturingseenheid vertaalt de opgehaalde instructie om te bepalen welke bewerking moet worden uitgevoerd.
3. **Operanden ophalen (Fetch Operands):** De benodigde gegevens (operanden) voor de instructie worden opgehaald, vaak uit registers of het hoofdgeheugen.
4. **Uitvoeren (Execute):** De ALU voert de bewerking uit zoals gespecificeerd door de instructie, gebruikmakend van de opgehaalde operanden.
5. **Resultaat opslaan (Store):** Het resultaat van de uitgevoerde bewerking wordt opgeslagen in een register of teruggeschreven naar het hoofdgeheugen.
6. **Hergebruik/Volgende instructie:** Het resultaat kan worden hergebruikt voor een volgende instructie, of de cyclus begint opnieuw met het ophalen van de volgende instructie.
#### 2.1.3 Kloksnelheid
De kloksnelheid van een processor is een cruciale eigenschap die bepaalt hoe snel de processor instructies kan verwerken.
* **Definitie:** De kloksnelheid wordt uitgedrukt in Hertz (Hz) en geeft het aantal cycli aan dat de processor per seconde kan uitvoeren. Tegenwoordig wordt dit gemeten in Gigahertz (GHz).
* **Impact:** Een hogere kloksnelheid betekent over het algemeen dat de processor meer instructies per seconde kan verwerken, wat resulteert in een snellere totale systeemprestatie.
> **Tip:** Houd er rekening mee dat kloksnelheid niet het enige is dat de prestaties van een processor bepaalt. De architectuur, het aantal cores en de efficiëntie van de instructieset spelen ook een belangrijke rol.
#### 2.1.4 Multi-core processors
Moderne processors bevatten vaak meerdere verwerkingseenheden, bekend als "cores", op één enkele chip.
* **Multi-CPU vs. Multi-Core:** Vroeger werden meerdere losse CPU's gebruikt (Multi-CPU). Tegenwoordig worden meerdere cores geïntegreerd op één processorchip (Multi-Core).
* **Voordelen:** Multi-core processors maken het mogelijk om meerdere taken gelijktijdig uit te voeren (parallelle verwerking), wat de algehele prestaties aanzienlijk kan verbeteren, vooral bij multithreading-toepassingen.
#### 2.1.5 Processor socket
De processor socket is de fysieke connector op het moederbord waar de processor wordt geplaatst.
* **Compatibiliteit:** Het type processor socket is cruciaal, omdat het bepaalt welke processors compatibel zijn met een specifiek moederbord. Verschillende processorfabrikanten en -generaties gebruiken verschillende sockettypes.
#### 2.1.6 Embedded processors
Naast de processors in traditionele computers, zijn er ook veel "embedded" processors.
* **Toepassingen:** Deze processors zijn geïntegreerd in een breed scala aan apparaten, zoals huishoudelijke apparaten, IoT-apparaten, auto's en industriële systemen. Ze zijn vaak ontworpen voor specifieke taken en hebben een lager energieverbruik.
### 2.2 Keuze van de processor
De keuze van een processor hangt af van diverse factoren die gerelateerd zijn aan de beoogde toepassing en de specificaties van het systeem:
* **Prestatie-eisen:** Welke taken moet de computer uitvoeren? Zware taken zoals videobewerking of gaming vereisen krachtigere processors met hogere kloksnelheden en meer cores.
* **Moederbordcompatibiliteit:** De processor moet compatibel zijn met de socket en de chipset van het moederbord.
* **Geheugen (RAM) matching:** De processor en het moederbord moeten compatibel zijn met de specificaties van het RAM-geheugen, zoals de DDR-standaard en snelheid.
* **Energieverbruik:** Voor mobiele apparaten of energiezuinige systemen is een processor met een lager energieverbruik vaak een prioriteit.
---
# Geheugentypen en hun eigenschappen
Dit onderwerp behandelt de verschillende soorten computergeheugen, hun kenmerken zoals vluchtigheid, en hun specifieke toepassingen.
### 3.1 Computergeheugen: Basisprincipes
Computergeheugen is essentieel voor het opslaan van data in binaire vorm, bestaande uit nullen ('0') en enen ('1'). Geheugen kan permanent of tijdelijk (vluchtig) zijn en dient voor het opslaan van programma's, documenten, bibliotheken, firmware, en meer.
### 3.2 RAM (Random Access Memory)
RAM is een type geheugen waarmee data gelezen en geschreven kan worden. Het is vluchtig, wat betekent dat de data verloren gaat zodra de stroomvoorziening wegvalt. Er zijn twee hoofdtypen DRAM:
#### 3.2.1 DRAM (Dynamic Random Access Memory)
DRAM is het meest voorkomende type werkgeheugen.
* **SDRAM (Synchronous Dynamic Random Access Memory):** Synchroniseert met de klok van de processor voor efficiëntere dataoverdracht.
* **DDR RAM (Double Data Rate Random Access Memory):** Verdubbelt de datasnelheid door data te verzenden bij zowel de stijgende als de dalende flank van het kloksignaal. Huidige standaarden zijn DDR4 en DDR5.
**Vormfactoren voor DRAM:**
* **Slotted:** Gebruikt in pc's en laptops, vaak in de vorm van DIMM's (Dual Inline Memory Module) voor pc's en SODIMM's (Small Outline DIMM) voor laptops.
* **Embedded:** Geïntegreerd in kleinere apparaten en IoT-apparaten.
**Compatibiliteit:** Verschillende DDR-versies zijn niet onderling compatibel.
**Keuzecriteria voor DRAM:**
* **Grootte (in GB):** Belangrijk voor de specifieke toepassing.
* **Fysieke vorm:** Afhankelijk van de beschikbare ruimte en het type apparaat.
* **DDR-versie en snelheid:** Moeten compatibel zijn met de processor en het moederbord.
#### 3.2.2 SRAM (Static Random Access Memory)
SRAM is aanzienlijk sneller dan DRAM, maar ook duurder. Het wordt in kleine hoeveelheden gebruikt als cachegeheugen binnen de processor om snelle toegang tot veelgebruikte data te garanderen.
### 3.3 ROM (Read Only Memory)
ROM is een type geheugen waarvan data alleen gelezen kan worden. Het is niet-vluchtig, wat betekent dat de data behouden blijft, zelfs zonder stroomvoorziening.
**Evolutie van ROM:**
* **ROM:** Originele vorm, niet programmeerbaar na productie.
* **PROM (Programmable Read-Only Memory):** Eenmalig programmeerbaar na fabricage.
* **EPROM (Erasable Programmable Read-Only Memory):** Programmeerbaar en wisbaar met UV-licht.
* **EEPROM (Electrically Erasable Programmable Read-Only Memory):** Programmeerbaar en wisbaar met elektrische pulsen.
**Toepassingen van ROM:**
* **Firmware:** Essentiële software ingebed in apparaten zoals wasmachines en ovens.
* **BIOS (Basic Input/Output System):** De opstartsoftware van een computer die de hardware initialiseert.
### 3.4 Computerbus
Een computerbus is een communicatiesysteem dat verschillende componenten binnen een computersysteem met elkaar verbindt en digitale data tussen deze componenten verstuurt. Er bestaan veel verschillende soorten bussen, variërend in snelheid, type (serieel/parallel) en leeftijd.
#### 3.4.1 Chipset
De chipset fungeert als tussenpersoon tussen verschillende bussen en is meestal geïntegreerd op het moederbord of in de processor. Voorbeelden van bussen die door de chipset worden beheerd, zijn:
* Front Side Bus (FSB): Verbinding tussen de CPU en de Northbridge.
* Memory Bus: Verbinding tussen het geheugen en de Northbridge.
* Randapparatuur Bussen (PCI, SATA, USB): Verbindingen naar de Southbridge.
---
# Communicatie tussen componenten
Dit deel behandelt de methoden van communicatie binnen een computersysteem, met een focus op computerbussen en de rol van de chipset als intermediair.
### 4.1 Computerbussen
Een computerbus is de verbinding die verschillende componenten binnen een computersysteem met elkaar in contact brengt. De basistaak van een bus is het versturen van digitale data tussen deze componenten. Er bestaan veel verschillende soorten bussen, variërend in snelheid (snel/traag), dataoverdracht (serieel/parallel) en ouderdom (nieuw/oud).
### 4.2 Chipset
De chipset fungeert als een tussenpersoon tussen de verschillende bussen in een computersysteem. Tegenwoordig is de chipset meestal geïntegreerd in het moederbord of zelfs in de processor zelf.
Historisch gezien kon de chipset worden onderverdeeld in verschillende bruggen, elk met een specifieke functie:
* **Northbridge (Noordbrug):** Deze verbond de snellere componenten van het systeem, zoals de processor met het RAM-geheugen. De communicatie tussen de processor en het geheugen liep via de zogenaamde **Memory Bus**. De verbinding tussen de processor en de Northbridge werd vroeger vaak de **Front Side Bus (FSB)** genoemd.
* **Southbridge (Zuidbrug):** Deze verbond de langzamere componenten en randapparatuur, zoals opslagapparaten (via **SATA**), uitbreidingskaarten (via **PCI**) en USB-apparaten.
**Tip:** Het begrijpen van de rol van de chipset en de verschillende bussen helpt bij het begrijpen van de prestaties en de uitbreidingsmogelijkheden van een computersysteem.
### 4.3 Communicatie tussen specifieke componenten
Binnen het systeem vinden verschillende vormen van communicatie plaats:
* **Processor <-> Northbridge (historisch):** Via de Front Side Bus (FSB).
* **Geheugen (RAM) <-> Northbridge (historisch):** Via de Memory Bus.
* **Randapparatuur (PCI, SATA, USB) <-> Southbridge (historisch):** Via specifieke interfaces.
Tegenwoordig zijn deze functies vaak samengevoegd of direct in de processor geïntegreerd, wat leidt tot efficiëntere communicatiepaden.
---
## Veelgemaakte fouten om te vermijden
- Bestudeer alle onderwerpen grondig voor examens
- Let op formules en belangrijke definities
- Oefen met de voorbeelden in elke sectie
- Memoriseer niet zonder de onderliggende concepten te begrijpen
Glossary
| Term | Definition |
|------|------------|
| Moederbord | Een printplaat die de verbinding legt tussen alle componenten van een computer, zoals de processor, het geheugen en uitbreidingskaarten, waardoor communicatie mogelijk is. |
| Processor (CPU) | Het hart van het computersysteem dat instructies uitvoert en berekeningen maakt. Het bevat een rekenkundige logische eenheid (ALU), een besturingseenheid (BE) en registers. |
| ALU (Arithmetic Logical Unit) | De rekenkundige logische eenheid binnen de processor die wiskundige en logische bewerkingen uitvoert op data. |
| BE (Besturingseenheid) | De besturingseenheid binnen de processor die instructies ophaalt, decodeert en de uitvoering ervan coördineert, inclusief de interactie met de ALU en het geheugen. |
| Von Neumann Cyclus | De operationele cyclus van een processor die bestaat uit het ophalen van een instructie, het decoderen ervan, het uitvoeren van de instructie, en het opslaan van het resultaat. |
| Kloksnelheid | De snelheid waarmee een processor instructies kan verwerken, gemeten in Hertz (Hz), wat aangeeft hoeveel cycli per seconde de processor kan doorlopen. |
| RAM (Random Access Memory) | Een type werkgeheugen dat snel lezen en schrijven mogelijk maakt, maar vluchtig is, wat betekent dat de data verloren gaat wanneer de stroom wordt uitgeschakeld. |
| DRAM (Dynamic Random Access Memory) | Een veelgebruikt type RAM dat data opslaat in condensatoren die periodiek ververst moeten worden om de data te behouden. |
| SDRAM (Synchronous Dynamic Random Access Memory) | Een type DRAM dat synchroon werkt met de systeemklok voor verbeterde prestaties. |
| DDRRAM (Double Data Rate Random Access Memory) | Een verbeterde versie van SDRAM die data kan overdragen tijdens zowel de stijgende als de dalende flank van de klokpuls, wat resulteert in een verdubbeling van de dataoverdrachtsnelheid. |
| ROM (Read Only Memory) | Een type niet-vluchtig geheugen waarvan de inhoud permanent is of alleen onder speciale omstandigheden kan worden gewijzigd. Het wordt gebruikt voor firmware en BIOS. |
| Computerbus | Een communicatiesysteem dat data overdraagt tussen verschillende componenten binnen een computer of tussen computers. Bussen variëren in snelheid, breedte en functionaliteit. |
| Chipset | Een groep geïntegreerde circuits op het moederbord die verantwoordelijk is voor het beheren van de gegevensstroom tussen de processor, het geheugen en randapparatuur. |
Cover
DDCArv_Ch4.pdf
Summary
# Introduction to hardware description languages
Hardware description languages (HDLs) are fundamental tools in digital design, enabling the specification, simulation, and synthesis of electronic circuits [4](#page=4).
### 1.1 The role of hardware description languages in digital design
A hardware description language (HDL) specifies only the logic function of a circuit. Computer-aided design (CAD) tools then use this specification to produce or synthesize the optimized gates that implement the desired function. The majority of commercial designs are built using HDLs [4](#page=4).
> **Tip:** When using an HDL, it's crucial to think about the actual hardware the code should produce and then write the HDL idiom that implies that hardware. Avoid treating HDLs like general-purpose software programming languages without considering the underlying hardware implications [6](#page=6).
### 1.2 Leading hardware description languages
Two prominent HDLs are SystemVerilog and VHDL [4](#page=4).
* **SystemVerilog:**
* Originated from Verilog, developed in 1984 by Gateway Design Automation [4](#page=4).
* Became an IEEE standard in 1995 [4](#page=4).
* Received extensions in 2005, leading to the IEEE STD 1800-2009 standard [4](#page=4).
* **VHDL:**
* Developed in 1981 by the Department of Defense [4](#page=4).
* Became an IEEE standard in 1987 [4](#page=4).
* Updated in 2008 to IEEE STD 1076-2008 [4](#page=4).
### 1.3 The process of simulation and synthesis
HDLs are integral to two key stages in the digital design workflow: simulation and synthesis [5](#page=5).
* **Simulation:**
* Involves applying inputs to the circuit described by the HDL and checking the outputs for correctness [5](#page=5).
* Debugging designs through simulation can save millions of dollars by identifying and fixing errors before actual hardware is produced [5](#page=5).
* **Synthesis:**
* This process transforms the HDL code into a netlist, which is a description of the hardware composed of gates and their interconnections [5](#page=5).
### 1.4 SystemVerilog modules
SystemVerilog utilizes modules to encapsulate hardware components. There are two primary types of modules [7](#page=7):
* **Behavioral modules:** These describe *what* a module does in terms of its functionality [7](#page=7).
* **Structural modules:** These describe *how* a module is constructed from simpler, interconnected modules [7](#page=7).
#### 1.4.1 Module declaration in SystemVerilog
A SystemVerilog module is declared using the `module` and `endmodule` keywords, which must enclose the module's definition. The name of the module follows the `module` keyword [8](#page=8).
**Basic structure:**
```systemverilog
module module_name(port_list);
// module body
endmodule
```
**Example:**
```systemverilog
module example(input logic a, b, c,
output logic y);
// module body goes here
endmodule
```
In this declaration:
* `module example`: Marks the beginning of a module named `example` [8](#page=8).
* `input logic a, b, c`: Declares three input signals of type `logic` [8](#page=8).
* `output logic y`: Declares one output signal of type `logic` [8](#page=8).
* `endmodule`: Marks the end of the module definition [8](#page=8).
#### 1.4.2 Behavioral SystemVerilog example
Behavioral descriptions often use constructs like the `assign` statement for combinational logic. This statement continuously assigns a value to a signal based on an expression [9](#page=9).
**Example:** Implementing a specific logic function:
```systemverilog
module example(input logic a, b, c,
output logic y);
assign y = ~a & ~b & ~c | a & ~b & ~c | a & ~b & c;
endmodule
```
In this example:
* `assign`: Indicates a continuous assignment [9](#page=9).
* `y =...`: The expression on the right side is continuously evaluated and assigned to the output `y` [9](#page=9).
* `~`: Bitwise NOT operator [9](#page=9).
* `&`: Bitwise AND operator [9](#page=9).
* `|`: Bitwise OR operator [9](#page=9).
This specific expression describes a circuit that produces an output `y` that is true under certain combinations of inputs `a`, `b`, and `c`. The synthesis tool will convert this behavioral description into the corresponding gate-level logic [11](#page=11) [9](#page=9).
### 1.5 HDL simulation
Simulation is a critical step where the behavior of the HDL code is tested by applying input stimuli and observing the output responses. This allows designers to verify the functional correctness of the design before proceeding to hardware implementation [10](#page=10) [5](#page=5).
### 1.6 HDL synthesis
Synthesis is the process where the HDL code is translated into a netlist of standard logic gates. This netlist represents the actual hardware implementation. Tools analyze the HDL description, such as the continuous assignment shown in the behavioral example, and determine the minimum set of logic gates (AND, OR, NOT, etc.) required to implement that functionality [11](#page=11).
### 1.7 SystemVerilog syntax fundamentals
SystemVerilog has specific syntax rules that must be followed for correct interpretation by design tools [12](#page=12).
* **Case sensitivity:** SystemVerilog is case-sensitive, meaning `reset` and `Reset` are treated as different signals [12](#page=12).
* **Naming conventions:** Names for signals, modules, etc., cannot start with a number. For example, `2mux` is an invalid name [12](#page=12).
* **Whitespace:** Whitespace (spaces, tabs, newlines) is generally ignored by the compiler, aiding in code readability [12](#page=12).
* **Comments:** Comments are used to explain the code and are ignored during synthesis and simulation.
* Single-line comments start with `//` [12](#page=12).
* Multiline comments are enclosed between `/*` and `*/` [12](#page=12).
---
# Combinational logic and operators in SystemVerilog
This section explores how to describe combinational logic in SystemVerilog, focusing on various operators, conditional assignments, internal variables, operator precedence, and numerical representation for bit manipulations.
### 2.1 Describing combinational logic
Combinational logic circuits are characterized by their outputs being solely dependent on their current inputs. SystemVerilog allows for the direct description of such logic using assignments.
#### 2.1.1 Bitwise operators
SystemVerilog supports standard bitwise operators for performing logical operations on individual bits or entire vectors. These operators include [15](#page=15):
* **AND (`&`):** Performs a logical AND operation bit by bit.
* **OR (`|`):** Performs a logical OR operation bit by bit.
* **XOR (`^`):** Performs a logical XOR operation bit by bit.
* **NAND (`~&`):** Performs a logical NAND operation bit by bit (NOT AND).
* **NOR (`~|`):** Performs a logical NOR operation bit by bit (NOT OR).
**Example:**
> **Example:** A module demonstrating different two-input logic gates operating on 4-bit buses.
> ```systemverilog
> module gates(input logic [3:0 a, b,
> output logic [3:0 y1, y2, y3, y4, y5);
> assign y1 = a & b; // AND
> assign y2 = a | b; // OR
> assign y3 = a ^ b; // XOR
> assign y4 = ~(a & b); // NAND
> assign y5 = ~(a | b); // NOR
> endmodule
> ```
> [15](#page=15).
#### 2.1.2 Reduction operators
Reduction operators operate on a vector and reduce it to a single bit. They are particularly useful for compact representation of logic that would otherwise require extensive explicit bitwise operations [16](#page=16).
* **Reduction AND (`&`):** Returns true if all bits in the vector are 1 [1](#page=1).
* **Reduction OR (`|`):** Returns true if any bit in the vector is 1 [1](#page=1).
* **Reduction XOR (`^`):** Returns true if an odd number of bits in the vector are 1 [1](#page=1).
* **Reduction NAND (`~&`):** The negation of the reduction AND.
* **Reduction NOR (`~|`):** The negation of the reduction OR.
* **Reduction XNOR (`~^`):** The negation of the reduction XOR.
**Example:**
> **Example:** A module that implements an 8-bit reduction AND operation.
> ```systemverilog
> module and8(input logic [7:0 a,
> output logic y);
> assign y = &a; // Equivalent to a & a &... & a [6](#page=6) [7](#page=7).
> endmodule
> ```
> [16](#page=16).
#### 2.1.3 Conditional assignment (ternary operator)
The conditional assignment operator, also known as the ternary operator (`?:`), provides a concise way to assign a value based on a condition. It evaluates three operands: a condition, a value if the condition is true, and a value if the condition is false [17](#page=17).
The syntax is: `assign
Glossary
| Term | Definition |
|------|------------|
| Hardware Description Language (HDL) | A specialized programming language used to describe the structure, design, and behavior of electronic circuits, primarily for digital systems. HDLs allow designers to model hardware at various levels of abstraction. |
| Simulation | The process of modeling the behavior of a hardware design over time using test inputs to verify its functional correctness before physical implementation. This helps in identifying and debugging errors early in the design cycle. |
| Synthesis | The automated process of converting a high-level HDL description into a lower-level netlist of primitive logic gates and their interconnections, which can then be used for manufacturing. |
| SystemVerilog | An extension of the Verilog HDL, providing advanced features for describing, simulating, and verifying complex digital hardware designs. It is a popular standard in the semiconductor industry. |
| VHDL | (VHSIC Hardware Description Language) A standardized HDL developed for describing electronic circuits, particularly for use by the U.S. Department of Defense. It is known for its strong typing and verbose syntax. |
| Module | A fundamental building block in HDLs like SystemVerilog, representing a distinct hardware component with inputs and outputs. Modules can be behavioral, describing functionality, or structural, detailing internal construction. |
| Behavioral Description | A style of HDL modeling that focuses on the functional behavior of a circuit without specifying its precise gate-level implementation. It describes what a module does rather than how it is built. |
| Structural Description | A style of HDL modeling that describes a hardware component by explicitly instantiating and connecting simpler sub-modules or primitive gates, detailing its physical structure. |
| Combinational Logic | A type of digital logic where the output is solely determined by the current combination of its inputs, with no memory elements. Its output changes instantaneously with input changes. |
| Sequential Logic | A type of digital logic where the output depends not only on the current inputs but also on the past sequence of inputs, utilizing memory elements like flip-flops and latches. |
| Flip-Flop | A sequential logic element that stores one bit of information and changes its state only on the edge of a clock signal. It is a fundamental building block for memory and state machines. |
| Latch | A sequential logic element that stores one bit of information and holds its state as long as a control signal is active. Unlike flip-flops, latches are level-sensitive. |
| always_ff | A SystemVerilog construct used to describe sequential logic that infers flip-flops. It is sensitive to clock edges and is intended for describing clocked sequential elements. |
| always_comb | A SystemVerilog construct used to describe combinational logic. It automatically infers the sensitivity list to ensure that the logic is purely combinational, reacting to any changes in its inputs. |
| always_latch | A SystemVerilog construct used to describe latches. It infers a sensitivity list based on the signals that control the latching behavior. |
| Blocking Assignment | An assignment operator (`=`) in HDLs that executes sequentially. Once a blocking assignment is executed, the value is immediately updated, affecting subsequent statements within the same block. |
| Nonblocking Assignment | An assignment operator (`<=`) in HDLs that schedules an update to occur at the end of the current simulation time step. This prevents race conditions in sequential logic and allows for more predictable behavior. |
| Finite State Machine (FSM) | A computational model consisting of a finite number of states, transitions between those states, and actions performed. FSMs are widely used in digital design to control sequential behavior. |
| Moore FSM | A type of Finite State Machine where the output depends only on the current state. The output logic block determines the output based solely on the state register's current value. |
| Mealy FSM | A type of Finite State Machine where the output depends on both the current state and the current inputs. The output logic block considers both the state and input signals. |
| Parameterized Module | A module in HDL that includes parameters, allowing its behavior or structure to be customized during instantiation. This enables design reuse and flexibility for different word sizes or configurations. |
| Testbench | A separate HDL module designed to test and verify the functionality of another module, known as the "device under test" (DUT). Testbenches provide stimulus and check the DUTs responses. |
| Device Under Test (DUT) | The specific hardware module or circuit that is being tested by a testbench. Its inputs are driven by the testbench, and its outputs are monitored and verified. |
| Netlist | A description of a digital circuit in terms of its constituent logic gates and their interconnections. It is the output of the synthesis process and the input for physical design tools. |
| Sensitivity List | A list of signals in an `always` block that triggers the execution of the block's statements when any of the listed signals change. |
| Ternary Operator | A conditional operator (`?:`) in SystemVerilog that evaluates a condition and returns one of two values based on whether the condition is true or false. It takes three operands: condition, value if true, and value if false. |
| Reduction Operator | An operator in SystemVerilog that reduces a multi-bit operand to a single bit by applying a binary operation (like AND, OR, XOR) across all bits of the operand. |
| Bitwise Operator | Operators that perform logical operations on corresponding bits of two operands. Examples include AND (`&`), OR (`|`), and XOR (`^`). |
| Floating Output | An output state (`z`) in digital logic indicating that the output is neither a definite high (1) nor a low (0), often used in tri-state buffers to disconnect a driver. |
| Testvector | A set of input values and their corresponding expected output values used to verify the behavior of a hardware design. Testvectors are often stored in files for automated testing. |
Cover
DDCArv_Ch5.pdf
Summary
# Arithmetic circuits and adders
This section explores the fundamental arithmetic circuits that form the basis of digital computation, with a primary focus on adders and their various implementations.
## 1. Arithmetic circuits and adders
### 1.1 Introduction to adders
Adders are essential combinational logic circuits that perform the arithmetic addition of binary numbers. They are a cornerstone of digital design, enabling operations like counting, arithmetic logic unit (ALU) functionality, and data processing within computer architectures [5](#page=5).
### 1.2 Basic adder building blocks
#### 1.2.1 Half adder
A half adder is the simplest adder circuit that adds two single binary bits, A and B, producing a Sum (S) and a Carry Out (Cout) [5](#page=5).
* **Logic:**
* Sum: $S = A \oplus B$ [5](#page=5).
* Carry Out: $C_{out} = A \cdot B$ [5](#page=5).
#### 1.2.2 Full adder
A full adder is a circuit that adds three single binary bits: two input bits, A and B, and a Carry In ($C_{in}$) from a previous, less significant bit position. It outputs a Sum (S) and a Carry Out ($C_{out}$) [5](#page=5).
* **Logic:**
* Sum: $S = A \oplus B \oplus C_{in}$ [5](#page=5).
* Carry Out: $C_{out} = AB + AC_{in} + BC_{in}$ [5](#page=5).
### 1.3 Multibit Adders
To add numbers larger than a single bit, multiple full adders are chained together. These are categorized as Carry Propagate Adders (CPAs). The efficiency of CPAs is largely determined by how the carry signal propagates through the stages [6](#page=6).
#### 1.3.1 Ripple-carry adder
A ripple-carry adder is constructed by connecting multiple 1-bit full adders in series. The carry-out of one stage becomes the carry-in of the next, rippling through the entire chain [8](#page=8).
* **Disadvantage:** This ripple effect makes ripple-carry adders inherently slow, especially for a large number of bits, as the carry must propagate through every stage before the final sum is determined [8](#page=8).
* **Delay:** The total delay of an N-bit ripple-carry adder is approximately $N$ times the delay of a single full adder ($t_{FA}$), i.e., $t_{ripple} = N \cdot t_{FA}$ [9](#page=9).
#### 1.3.2 Carry-lookahead adder (CLA)
Carry-lookahead adders aim to overcome the speed limitations of ripple-carry adders by calculating carry signals in parallel, rather than waiting for them to propagate. This is achieved by using "generate" and "propagate" signals [10](#page=10) [11](#page=11).
* **Generate ($G_i$) and Propagate ($P_i$) signals for a column *i*:**
* **Generate ($G_i$):** A carry is generated at column *i* if both input bits $A_i$ and $B_i$ are 1.
$G_i = A_i \cdot B_i$ [11](#page=11).
* **Propagate ($P_i$):** A carry-in to column *i* will propagate to the carry-out if either $A_i$ or $B_i$ (or both) is 1.
$P_i = A_i + B_i$ [11](#page=11).
* **Carry Out ($C_i$) of a column:** This can be expressed as:
$C_i = G_i + P_i \cdot C_{i-1}$ [11](#page=11).
* **Block Propagate and Generate signals:** To speed up carry propagation across multiple bits, these column-level signals are used to compute block-level propagate and generate signals for groups of bits (e.g., k-bit blocks) [13](#page=13).
* **Block Propagate ($P_{k:0}$):** A carry-in to a k-bit block will propagate through all *k* bits to become the block's carry-out if each individual bit within the block propagates. For a 4-bit block (bits 3:0):
$P_{3:0} = P_3 \cdot P_2 \cdot P_1 \cdot P_0$ [14](#page=14) [15](#page=15).
* **Block Generate ($G_{k:0}$):** A carry is generated by a k-bit block if it's generated in any bit position within the block, or if it's generated in a lower position and propagated through all higher positions up to the block's carry-out. For a 4-bit block (bits 3:0):
$G_{3:0} = G_3 + G_2P_3 + G_1P_2P_3 + G_0P_1P_2P_3$ [15](#page=15).
This can be factored as:
$G_{3:0} = G_3 + P_3(G_2 + P_2(G_1 + P_1G_0))$ [15](#page=15) [17](#page=17) [18](#page=18).
* **Block Carry Out ($C_{k}$):** The carry-out of a k-bit block is determined by the block's generate and propagate signals:
$C_k = G_{k:0} + P_{k:0} \cdot C_{-1}$ (where $C_{-1}$ is the carry-in to the block) [17](#page=17).
* **Structure:** A 32-bit CLA can be implemented using 4-bit blocks, where each block has its own carry-lookahead logic, and a higher-level carry-lookahead logic handles carries between these blocks [19](#page=19) [25](#page=25).
* **Steps in CLA addition:**
1. Compute $G_i$ and $P_i$ for all columns [11](#page=11) [21](#page=21).
2. Compute $G$ and $P$ for k-bit blocks [22](#page=22).
3. The carry-in propagates through each k-bit block's generate/propagate logic, while sums are computed concurrently [23](#page=23).
4. Compute the sum for the most significant k-bit block [24](#page=24).
* **Delay:** The delay of an N-bit CLA with k-bit blocks is given by $t_{CLA} = t_{pg} + t_{pg\_block} + (N/k – 1)t_{AND\_OR} + k \cdot t_{FA}$. For large N (e.g., $N > 16$), CLAs are significantly faster than ripple-carry adders [26](#page=26).
#### 1.3.3 Prefix adders
Prefix adders, also known as carry-prefix adders, offer further speed improvements by computing all carry-prefix signals in parallel. They achieve this by pre-calculating carry-in signals for each column without waiting for a sequential ripple [27](#page=27) [28](#page=28).
* **Carry Calculation:** The core idea is to compute carry-in ($C_{i-1}$) for each column *i* efficiently. The sum $S_i$ is then computed as $S_i = (A_i \oplus B_i) \oplus C_{i-1}$ [28](#page=28).
* **Prefixes:** The circuit aims to quickly compute "prefixes" of the form $G_{k-1:-1}$ for various block sizes (1, 2, 4, 8 bits, etc.) until all carry-ins are known. These prefixes are essentially the carry-outs of the blocks [29](#page=29).
* **Generate ($G_{i:j}$) and Propagate ($P_{i:j}$) for a block spanning bits *i* to *j*:** These are defined recursively [30](#page=30).
* **Generate ($G_{i:j}$):** A carry is generated for the block $i:j$ if the upper part ($i:k$) generates a carry, or if the upper part propagates a carry generated by the lower part ($k-1:j$).
$G_{i:j} = G_{i:k} + P_{i:k} \cdot G_{k-1:j}$ [30](#page=30).
* **Propagate ($P_{i:j}$):** A carry propagates through the block $i:j$ if both the upper part ($i:k$) and the lower part ($k-1:j$) propagate the carry.
$P_{i:j} = P_{i:k} \cdot P_{k-1:j}$ [30](#page=30).
* **Carry-in to column *i* ($C_{i-1}$):** This is equivalent to the carry generated for the prefix spanning up to column $i-1$.
$C_{i-1} = G_{i-1:-1}$ [29](#page=29).
* **Delay:** The delay of a prefix adder for N bits is $t_{PA} = t_{pg} + \log_2 N (t_{pg\_prefix}) + t_{XOR}$ where $t_{pg}$ is the delay to produce initial $P_i, G_i$ and $t_{pg\_prefix}$ is the delay of a prefix cell. Prefix adders are generally the fastest among the discussed CPA types for large N [33](#page=33) [34](#page=34).
### 1.4 Adder Delay Comparisons
Comparing the delays of different 32-bit adder implementations using a 2-input gate delay of 100 ps and a full adder delay of 300 ps:
* **Ripple-carry adder:** $t_{ripple} = 32 \times 300 \text{ ps} = 9.6 \text{ ns}$ [34](#page=34).
* **Carry-lookahead adder (32-bit with 4-bit blocks):** $t_{CLA} = [100 + 600 + (32/4 – 1)200 + 4 \text{ ps} = [100 + 600 + 7 \times 200 + 1200 \text{ ps} = 3.3 \text{ ns}$ [34](#page=34).
* **Prefix adder:** $t_{PA} = [100 + \log_2 32 + 100 \text{ ps} = [100 + 5 \times 200 + 100 \text{ ps} = 1.2 \text{ ns}$ [34](#page=34).
This comparison highlights that prefix adders offer the lowest delay, followed by carry-lookahead adders, with ripple-carry adders being the slowest due to the sequential carry propagation. The trade-off for increased speed in CLAs and prefix adders is typically higher hardware complexity [34](#page=34) [6](#page=6).
> **Tip:** When analyzing adder delays, remember that the complexity of the gates used for generate, propagate, and prefix computations directly impacts the overall circuit speed.
>
> **Example:** Consider a 32-bit addition. A ripple-carry adder would require 32 full adder delays sequentially. A carry-lookahead adder with 4-bit blocks would have a carry-lookahead logic for the blocks (which depends on the number of blocks), plus the delay within each block. A prefix adder would have a delay logarithmic to the number of bits, making it asymptotically faster.
---
# Subtractors, comparators, and the ALU
This section explores fundamental digital logic components: subtractors, comparators, and the Arithmetic Logic Unit (ALU), detailing their operations, control mechanisms, and status flag generation.
### 2.1 Subtractors
A subtractor performs the arithmetic operation of subtraction. It can be implemented using an adder by converting the subtraction of B into the addition of its two's complement, along with a carry-in of 1. This relationship is expressed as [36](#page=36):
$A - B = A + \overline{B} + 1$ [36](#page=36).
### 2.2 Comparators
Comparators are digital circuits that determine the relationship between two binary numbers.
#### 2.2.1 Equality comparator
An equality comparator checks if two numbers, A and B, are identical. It outputs a signal that is HIGH if A is equal to B, and LOW otherwise [1](#page=1) [37](#page=37).
#### 2.2.2 Signed less than comparator
A comparator can also determine if one signed number is less than another. A number A is considered less than B ($A < B$) if the result of subtracting B from A ($A - B$) is negative. However, care must be taken to account for potential overflow conditions when performing this subtraction, as overflow can lead to incorrect sign interpretation [38](#page=38).
### 2.3 The Arithmetic Logic Unit (ALU)
The ALU is a crucial combinational logic circuit within a processor that performs arithmetic and logic operations on binary operands [39](#page=39) [40](#page=40).
#### 2.3.1 Basic operations
A typical ALU is designed to perform fundamental operations such as:
* Addition [40](#page=40).
* Subtraction [40](#page=40).
* Logical AND [40](#page=40).
* Logical OR [40](#page=40).
#### 2.3.2 Control signals and operation selection
The specific operation performed by the ALU is determined by a set of control signals. For a 2-bit control signal (`ALUControl1:0`), different combinations dictate the operation:
* `00`: Add [41](#page=41) [42](#page=42) [43](#page=43).
* `01`: Subtract [41](#page=41) [42](#page=42) [43](#page=43).
* `10`: AND [41](#page=41) [42](#page=42).
* `11`: OR [41](#page=41) [42](#page=42).
The control signals manage multiplexers within the ALU to select the appropriate functional unit (e.g., adder, OR gate) and route its output to the ALU's result bus. For addition and subtraction, the `ALUControl0` signal often controls the carry-in to the adder [41](#page=41) [42](#page=42) [43](#page=43).
> **Tip:** Understanding how control signals direct data flow through functional units is key to comprehending ALU operation.
#### 2.3.3 Status flags
ALUs also generate status flags, which are single-bit outputs that provide information about the result of the most recent operation. These flags are essential for conditional branching and error detection [44](#page=44) [45](#page=45).
* **N (Negative flag):** Set to 1 if the most significant bit (MSB) of the result is 1, indicating a negative result in signed number representations [46](#page=46).
* **Z (Zero flag):** Set to 1 if all bits of the result are 0, indicating that the result of the operation is zero [47](#page=47).
* **C (Carry flag):** Set to 1 if the adder produces a carry-out. This flag is relevant for addition and subtraction operations (when `ALUControl` is `00` or `01`) [48](#page=48).
* **V (Overflow flag):** Set to 1 if an arithmetic operation results in an overflow. Overflow occurs when the addition of two numbers with the same sign produces a result with the opposite sign. The condition for overflow can be complex, involving the signs of the operands and the result, and is dependent on whether the ALU is performing addition or subtraction [49](#page=49) [50](#page=50).
#### 2.3.4 Comparisons based on flags
The status flags can be combined logically to perform comparisons between operands, distinguishing between signed and unsigned interpretations:
| Comparison | Signed | Unsigned |
| :---------- | :----------------------------------- | :------- |
| `==` | $Z$ | $Z$ |
| `!=` | $\sim Z$ | $\sim Z$ |
| `<` | $N \oplus V$ | $\sim C$ |
| `<=` | $Z \lor (N \oplus V)$ | $Z \lor \sim C$ |
| `>` | $\sim Z \land \sim(N \oplus V)$ | $\sim Z \land C$ |
| `>=` | $\sim(N \oplus V)$ | $C$ |
The general principle is to perform a subtraction ($A - B$) and then interpret the resulting flags to determine the comparison outcome [52](#page=52).
#### 2.3.5 Other ALU operations
Beyond basic arithmetic and logic, ALUs can support other operations:
* **Set Less Than (SLT):** This operation sets the least significant bit (LSB) of the result to 1 if $A < B$, and to 0 otherwise. It comes in both signed and unsigned versions. The implementation often involves an adder and logic to derive the SLT condition from the ALU's control signals and flags [53](#page=53) [54](#page=54) [55](#page=55).
* **XOR:** Performs the bitwise exclusive OR operation between operands A and B [53](#page=53).
---
# Shifters, multipliers, dividers, and number representations
This section explores fundamental digital operations, including bit shifting for multiplication and division, various multiplication and division algorithms, and different methods for representing numbers in binary, ranging from fixed-point to floating-point formats.
### 3.1 Shifters
Shifters are digital circuits that move the bits of a binary number. Different types of shifters exist, each with specific behaviors for filling vacated bit positions [57](#page=57).
#### 3.1.1 Logical Shifter
A logical shifter shifts bits to the left or right and fills any empty positions with zeros [57](#page=57).
* **Left Shift (<<):** Bits are moved to the left. The least significant bits (LSBs) are filled with zeros. For example, `11001 << 2` results in `00100` [57](#page=57).
* **Right Shift (>>):** Bits are moved to the right. The most significant bits (MSBs) are filled with zeros. For example, `11001 >> 2` results in `00110` [57](#page=57).
#### 3.1.2 Arithmetic Shifter
An arithmetic shifter behaves similarly to a logical shifter, but for right shifts, it fills the vacated MSB positions with the value of the original most significant bit (sign bit). This preserves the sign of a signed number [57](#page=57).
* **Arithmetic Right Shift (>>>):** For a number like `11001`, an arithmetic right shift by 2 (`11001 >>> 2`) results in `11110` [57](#page=57).
#### 3.1.3 Rotator
A rotator shifts bits in a circular manner. Bits shifted off one end of the number are reintroduced at the other end [57](#page=57).
* **Rotate Right (ROR):** Bits shifted off the right end reappear on the left. For example, `11001 ROR 2` results in `01110` [57](#page=57).
* **Rotate Left (ROL):** Bits shifted off the left end reappear on the right. For example, `11001 ROL 2` results in `00111` [57](#page=57).
#### 3.1.4 Shifters as Multipliers and Dividers
Shifters can efficiently perform multiplication and division by powers of two.
* **Multiplication by 2^N:** A left shift by `N` positions is equivalent to multiplying the number by $2^N$ [59](#page=59).
* Example: `00001 << 3` is equivalent to $1 \times 2^3 = 8$, resulting in `01000` [59](#page=59).
* Example with signed numbers: `11101 << 2` (representing -3 in two's complement) is equivalent to $-3 \times 2^2 = -12$, resulting in `10100` [59](#page=59).
* **Division by 2^N:** An arithmetic right shift by `N` positions is equivalent to dividing the number by $2^N$ [59](#page=59).
* Example: `01000 >>> 1` is equivalent to $8 \div 2^1 = 4$, resulting in `00100` [59](#page=59).
* Example with signed numbers: `10000 >>> 2` (representing -16) is equivalent to $-16 \div 2^2 = -4$, resulting in `11100` [59](#page=59).
> **Tip:** Using shifters for multiplication and division by powers of two is significantly faster than using general multiplication and division hardware.
### 3.2 Multipliers
Multiplication in binary involves forming partial products by multiplying each bit of the multiplier with the multiplicand, and then summing these shifted partial products [60](#page=60).
#### 3.2.1 Multiplication Process
The process can be visualized with a decimal example: $230 \times 42$. This breaks down into $(230 \times 2)$ and $(230 \times 40)$, which are then added. In binary, for a 4x4 multiplication:
* Each bit of the multiplier ($B_i$) is ANDed with each bit of the multiplicand ($A_j$) to form partial products $A_j B_i$ [61](#page=61).
* These partial products are then shifted appropriately and summed to produce the final product [61](#page=61).
### 3.3 Dividers
Division in binary, similar to decimal long division, aims to find a quotient (Q) and a remainder (R) such that the dividend (A) can be expressed as $A = B \times Q + R$, where $0 \le R < B$. The general form is $A/B = Q + R/B$ [62](#page=62) [63](#page=63).
#### 3.3.1 Binary Long Division
The binary division process involves repeatedly subtracting the divisor (B) from a portion of the dividend.
* An iterative algorithm can be described as follows [64](#page=64):
* Initialize a remainder register $R'$ to 0.
* For each bit $A_i$ from the most significant to the least significant:
* Shift $R'$ left by one bit and append the current bit $A_i$ to form a new $R$: $R = \{R' \ll 1, A_i\}$ [64](#page=64).
* Subtract the divisor $B$ from $R$: $D = R - B$ [64](#page=64).
* If $D < 0$, the quotient bit $Q_i$ is 0, and $R'$ is set to $R$ (no subtraction occurred) [64](#page=64).
* If $D \ge 0$, the quotient bit $Q_i$ is 1, and $R'$ is set to $D$ (subtraction occurred) [64](#page=64).
* After iterating through all bits, the final $R'$ holds the remainder [64](#page=64).
> **Example:** Binary division of $1101$ by $10$ ($13 \div 2$).
> * $R' = 0$.
> * $i=3$ ($A_3=1$): $R = \{0 \ll 1, 1\} = 1$. $D = 1 - 10 = -1$. $Q_3=0$, $R'=1$.
> * $i=2$ ($A_2=1$): $R = \{1 \ll 1, 1\} = 11_2 = 3$. $D = 11_2 - 10_2 = 01_2 = 1$. $Q_2=1$, $R'=1$.
> * $i=1$ ($A_1=0$): $R = \{1 \ll 1, 0\} = 10_2 = 2$. $D = 10_2 - 10_2 = 0$. $Q_1=1$, $R'=0$.
> * $i=0$ ($A_0=1$): $R = \{0 \ll 1, 1\} = 1$. $D = 1 - 10_2 = -1$. $Q_0=0$, $R'=1$.
> * Final quotient is $0110$ and remainder is $1$ [64](#page=64) [6](#page=6).
### 3.4 Fixed-Point Numbers
Fixed-point representation defines a specific number of bits for the integer part and a specific number of bits for the fractional part of a number, with the binary point's position fixed [69](#page=69) [70](#page=70).
#### 3.4.1 Number Systems
Binary representations can denote positive numbers (unsigned binary) or negative numbers, typically using two's complement or sign/magnitude formats [68](#page=68).
#### 3.4.2 Representation
* **Unsigned Fixed-Point Format (Ua.b):** Represents an unsigned number with `a` integer bits and `b` fractional bits. For example, U4.4 means 4 bits for the integer part and 4 bits for the fractional part [71](#page=71).
* Example: $6.75$ in U4.4 is `01101100` [70](#page=70) [71](#page=71).
* Common formats include 8-bit, 16-bit, and 32-bit fixed-point numbers, such as U8.8 for sensor data and U16.16 for higher precision signal processing [71](#page=71).
* **Signed Fixed-Point Format (Qa.b):** Represents a signed number (typically in two's complement) with `a` integer bits (including the sign bit) and `b` fractional bits [72](#page=72).
* To negate a Q fixed-point number, invert all bits and add one to the LSB [72](#page=72).
* Example: To write $-6.75$ in Q4.4:
* $6.75$ is `01101100`.
* Inverting bits: `10010011`.
* Adding 1 to LSB: `10010100` [72](#page=72).
* Q1.15 (also known as Q15) is a common format for signal processing, representing values in the range $[1, -1)$ [72](#page=72).
#### 3.4.3 Saturating Arithmetic
Saturating arithmetic is an overflow handling technique where, instead of wrapping around, the result is clamped to the maximum or minimum representable value if an overflow occurs [73](#page=73).
* **Example:** In U4.4, adding $12$ (`11000000`) and $7.5$ (`01111000`) would normally overflow. With saturating arithmetic, the result is the maximum representable value, $15.9375$ (`11111111`) [73](#page=73).
> **Tip:** Saturating arithmetic is crucial in applications like signal processing and graphics to prevent artifacts caused by overflow, such as clicking sounds in audio or dark pixels in video [73](#page=73).
### 3.5 Floating-Point Numbers
Floating-point numbers represent a wide range of values by using a variable position for the binary point, similar to scientific notation in decimal [69](#page=69) [75](#page=75).
#### 3.5.1 Representation
A floating-point number is generally expressed as $\pm M \times B^E$, where $M$ is the mantissa, $B$ is the base, and $E$ is the exponent [75](#page=75).
* In binary, the base $B$ is typically 2.
* The binary point floats to the right of the most significant '1' bit in the mantissa [69](#page=69).
#### 3.5.2 Floating vs. Fixed Point
* **Floating-point:** Offers a greater dynamic range (smallest to largest values) and is preferred for general-purpose computing where ease of programming is paramount. However, floating-point arithmetic is more complex and can be slower and more power-intensive [76](#page=76).
* **Fixed-point:** Has a smaller dynamic range but is simpler and more efficient for hardware implementations, making it ideal for performance-critical applications like signal processing, machine learning, and video processing [76](#page=76).
#### 3.5.3 IEEE 754 Standard
The IEEE 754 standard defines the representation for floating-point numbers, commonly in 32-bit (single-precision) and 64-bit (double-precision) formats. A 32-bit floating-point number is divided into three fields [77](#page=77) [83](#page=83):
* **Sign (1 bit):** 0 for positive, 1 for negative [77](#page=77).
* **Exponent (8 bits):** Represents the power of 2. It is stored in a "biased" form to allow for both positive and negative exponents and to simplify comparisons. The bias for single-precision is 127. The stored exponent is `bias + actual exponent` [77](#page=77) [80](#page=80) [83](#page=83).
* **Mantissa/Fraction (23 bits):** Represents the significant digits of the number. The leading '1' before the binary point is implicit and not stored, saving a bit and increasing precision [79](#page=79).
##### 3.5.3.1 Converting to IEEE 754
To represent a decimal number in IEEE 754 32-bit format:
1. Convert the magnitude of the decimal number to its binary equivalent [78](#page=78).
2. Express the binary number in scientific notation: $1.xxxxx \times 2^{\text{exponent}}$ [78](#page=78).
3. Fill the fields:
* **Sign bit:** Determined by the original number's sign [78](#page=78).
* **Exponent bits:** Calculate the biased exponent: `bias + actual exponent` [80](#page=80).
* **Fraction bits:** Store the bits of the mantissa that appear after the binary point [79](#page=79).
> **Example:** Representing $228_{10}$ in IEEE 754.
> 1. $228_{10} = 11100100_2$ [78](#page=78).
> 2. Binary scientific notation: $1.11001_2 \times 2^7$ [78](#page=78).
> 3. Fields:
> * Sign: 0 (positive) [78](#page=78).
> * Exponent: $127 (\text{bias}) + 7 = 134 = 10000110_2$ [80](#page=80).
> * Fraction: $11001000000000000000000$ (padded to 23 bits) [79](#page=79).
> The resulting representation is `0 10000110 11001000000000000000000` [80](#page=80).
##### 3.5.3.2 Special Cases
The IEEE 754 standard also defines special values:
* **Zero:** Sign exponent fraction $\pm 0$ `X 00000000 00000000000000000000000` [82](#page=82).
* **Infinity ($\infty$):** Sign exponent fraction `0 11111111 00000000000000000000000` (positive infinity) or `1 1111111 00000000000000000000000` (negative infinity) [82](#page=82).
* **Not a Number (NaN):** Sign exponent fraction `X 11111111 non-zero` [82](#page=82).
##### 3.5.3.3 Precision and Rounding
* **Single-Precision:** 32 bits (1 sign, 8 exponent, 23 fraction), bias = 127 [83](#page=83).
* **Double-Precision:** 64 bits (1 sign, 11 exponent, 52 fraction), bias = 1023 [83](#page=83).
* **Overflow:** Occurs when a number is too large to be represented [84](#page=84).
* **Underflow:** Occurs when a number is too small (close to zero) to be represented [84](#page=84).
* **Rounding Modes:** Used to adjust a number when it cannot be precisely represented after an operation. Common modes include rounding down, up, toward zero, and to the nearest value [84](#page=84).
> **Example of Rounding:** Rounding $1.100101_2$ (which is $1.578125_{10}$) to 3 fraction bits.
> * Down: $1.100_2 = 1.5_{10}$
> * Up: $1.101_2 = 1.625_{10}$
> * Toward zero: $1.100_2 = 1.5_{10}$
> * To nearest: $1.101_2 = 1.625_{10}$ (since $1.625$ is closer to $1.578125$ than $1.5$) [84](#page=84).
### 3.6 Floating-Point Addition
Adding floating-point numbers is a multi-step process that requires aligning the numbers before performing the addition [86](#page=86).
#### 3.6.1 Addition Algorithm
The general steps for floating-point addition are:
1. **Extract components:** Separate the sign, exponent, and fraction bits from both numbers [86](#page=86) [88](#page=88).
2. **Form mantissas:** Prepend the implicit leading '1' to the fraction bits to reconstruct the full mantissa [86](#page=86) [88](#page=88).
3. **Compare exponents:** Determine the difference between the exponents of the two numbers [86](#page=86) [89](#page=89).
4. **Align mantissas:** Shift the mantissa of the number with the smaller exponent to the right by the difference in exponents. This ensures both numbers have the same exponent before addition [86](#page=86) [89](#page=89).
5. **Add mantissas:** Perform binary addition on the aligned mantissas. The exponent remains the same as the larger of the two original exponents [86](#page=86) [89](#page=89).
6. **Normalize mantissa:** If the result of the addition has a mantissa that is not in the form $1.xxxxx$ (e.g., $10.xxxxx$ or $0.1xxxxx$), adjust it by shifting the binary point and updating the exponent accordingly [86](#page=86) [90](#page=90).
7. **Round result:** Apply a rounding mode if the normalized mantissa exceeds the available fraction bits [86](#page=86).
8. **Assemble:** Combine the sign, the adjusted exponent (biased), and the fraction bits back into the floating-point format [86](#page=86) [90](#page=90).
> **Example:** Adding $0\text{x}3\text{FC}00000$ and $0\text{x}40500000$.
> * N1: $0\ 01111111\ 10000000000000000000000$ (S=0, E=127, F=.1) $\implies$ Mantissa = $1.1_2$, Exponent = 0 [88](#page=88).
> * N2: $0\ 10000000\ 10100000000000000000000$ (S=0, E=128, F=.101) $\implies$ Mantissa = $1.101_2$, Exponent = 1 [88](#page=88).
> * Compare exponents: $127 < 128$. Exponent difference is 1.
> * Align N1: Shift $1.1_2$ right by 1 $\implies 0.11_2$ (now with exponent 1) [89](#page=89).
> * Add mantissas: $0.11_2 + 1.101_2 = 10.011_2$ [89](#page=89).
> * Normalize: $10.011_2 \times 2^1 = 1.0011_2 \times 2^2$ [90](#page=90).
> * Assemble: S=0, Exponent = $2 + 127 = 129 = 10000001_2$, Fraction = $001100...$. Result is $0\ 10000001\ 00110000000000000000000$, which is $0\text{x}40980000$ [90](#page=90).
---
# Counters, shift registers, and memory arrays
This section delves into essential sequential digital building blocks, covering the functionality, implementation, and applications of counters, shift registers, and various types of memory arrays [91](#page=91).
### 4.1 Counters
Counters are digital circuits that increment or decrement a binary number on each clock edge. They are fundamentally used to cycle through a sequence of numbers [92](#page=92).
#### 4.1.1 Counter implementation and applications
A basic counter increments its value on each active clock edge. If a reset signal is asserted, the counter typically returns to a predefined initial state, often zero [92](#page=92) [93](#page=93).
**Applications include:**
* Digital clock displays [92](#page=92).
* Program counters, which track the address of the next instruction to be executed in a processor [92](#page=92).
A SystemVerilog implementation of a counter with a synchronous reset demonstrates this behavior. The `always_ff` block describes a flip-flop that triggers on the positive edge of the clock. If the `reset` signal is high, the counter `q` is reset to 0; otherwise, it is incremented by 1 (`q <= q + 1;`). An alternative, more verbose implementation explicitly uses an adder to compute the next state before assigning it to the flip-flops [93](#page=93).
> **Tip:** Synchronous resets are generally preferred in digital design as they ensure that the reset signal is synchronized with the clock, preventing timing issues.
#### 4.1.2 Divide-by-2N counter
A divide-by-2N counter is a specific type of counter where the most significant bit (MSB) toggles its state every $2^N$ clock cycles. This property makes them useful for functions like slowing down a clock signal or creating periodic events, such as blinking an LED. For example, a 50 MHz clock divided by a 24-bit counter can produce a signal with a frequency of approximately 2.98 Hz [94](#page=94).
> **Example:** To blink an LED with a 50 MHz clock, one could use a 24-bit counter. The MSB of this counter would flip every $2^{24}$ clock cycles, effectively creating a blinking signal.
#### 4.1.3 Digitally Controlled Oscillator (DCO)
A digitally controlled oscillator (DCO) is an extension of a counter that can generate an output frequency proportional to a digital input value. Instead of simply incrementing by 1 on each clock cycle, the counter increments by a value 'p'. The output frequency ($f_{out}$) is then given by the formula [95](#page=95):
$$f_{out} = f_{clk} \times \frac{p}{2^N}$$
where $f_{clk}$ is the input clock frequency, $p$ is the increment value, and $N$ is the number of bits in the counter [95](#page=95).
> **Example:** To generate a 200 Hz signal from a 50 MHz clock ($f_{clk}$), using a 24-bit counter ($N=24$):
> We need $\frac{p}{2^{24}} = \frac{200}{50 \text{ MHz}}$.
> If we choose $p = 67$, then $f_{out} = 50 \text{ MHz} \times \frac{67}{2^{24}} \approx 199.676 \text{ Hz}$.
> For a more precise frequency, a 32-bit counter ($N=32$) with $p = 17179$ can yield $f_{out} \approx 199.990 \text{ Hz}$.
### 4.2 Shift Registers
Shift registers are sequential circuits that store and shift binary data serially. They consist of a chain of flip-flops, where the output of one flip-flop is connected to the input of the next [96](#page=96).
#### 4.2.1 Serial-to-parallel conversion
On each clock edge, a new bit is shifted into the first flip-flop, and the data in all subsequent flip-flops is shifted one position down the chain. This process effectively converts a serial input data stream (`Sin`) into a parallel output word (`Q0` to `QN-1`) [96](#page=96).
#### 4.2.2 Shift register with parallel load
Shift registers can be augmented with a 'load' control signal [97](#page=97).
* When the `Load` signal is high (e.g., `Load = 1`), the register behaves like a standard parallel-load register, accepting data from parallel inputs (`D0` to `DN-1`) [97](#page=97).
* When `Load` is low (`Load = 0`), the register functions as a traditional shift register, shifting data from `Sin` to the output `Sout` [97](#page=97).
This dual functionality allows shift registers with parallel load to act as both serial-to-parallel converters (shifting `Sin` into `Q0:N-1`) and parallel-to-serial converters (shifting `D0:N-1` out through `Sout`) [97](#page=97).
The SystemVerilog implementation for a configurable shift register with parallel load and synchronous reset is provided. The `always_ff` block handles the logic, prioritizing reset, then load, and finally the serial shift operation. The serial output `sout` is assigned to the most significant bit of the register `q[N-1]` [98](#page=98).
### 4.3 Memory arrays
Memory arrays are fundamental structures designed for efficient storage of large amounts of digital data. They consist of a two-dimensional arrangement of bit cells, where each cell stores a single bit [100](#page=100) .
#### 4.3.1 Structure and addressing
A memory array is characterized by its depth (number of rows or words) and width (number of columns or bits per word). For an N-bit address and M-bit data, there are $2^N$ unique addresses, defining $2^N$ rows (depth), and each row can store an M-bit data value (width). (#page=100, 101) The total array size is depth $\times$ width, which equals $2^N \times M$ [100](#page=100) .
> **Example:** A 22 × 3-bit array signifies a memory with 4 words (since $2^2 = 4$) and each word is 3 bits wide. The word stored at address `10` (binary for 2) would be `100`. A 1024-word by 32-bit array requires 10 address bits ($2^{10} = 1024$) and outputs 32 data bits .
#### 4.3.2 Wordlines and Bitlines
Access to individual memory cells is managed through wordlines and bitlines .
* **Wordline:** Similar to an enable signal, a wordline selects a specific row in the memory array. Only one wordline is active (high) at any given time, corresponding to the unique address being accessed. Reading or writing occurs on the row enabled by the wordline .
* **Bitline:** These lines carry the data to and from the memory cells. When a wordline is activated, the bitlines are used to either read the stored bit's value or to write a new value into the cell .
#### 4.3.3 Types of memory
Memory arrays are broadly categorized into two main types: Random Access Memory (RAM) and Read-Only Memory (ROM) .
##### 4.3.3.1 Random Access Memory (RAM)
RAM is volatile memory, meaning it loses its stored data when power is removed. (#page=106, 107) It offers fast read and write operations. The term "random access" signifies that any data word can be accessed with equal ease and speed, unlike sequential access memories .
There are two primary types of RAM:
* **Dynamic Random Access Memory (DRAM):** Invented by Robert Dennard, DRAM stores each bit of data on a capacitor. (#page=110, 111, 112) The term "dynamic" arises because the charge on the capacitor leaks over time, necessitating periodic refreshing (rewriting) to maintain the data. Reading a bit from DRAM also destroys its stored value, requiring a read-and-refresh cycle. A DRAM bit cell typically consists of a transistor and a capacitor .
* **Static Random Access Memory (SRAM):** SRAM stores data using cross-coupled inverters, which do not require refreshing as long as power is supplied. (#page=110, 114, 115) This makes SRAM faster and simpler to access than DRAM, but it typically requires more transistors per bit, making it less dense and more expensive .
##### 4.3.3.2 Read-Only Memory (ROM)
ROM is nonvolatile memory, retaining its data even when power is switched off. (#page=106, 108) It is designed for quick reading, but writing data is either impossible or a slow process. Historically, ROMs were programmed at the time of fabrication, but modern ROMs like Flash memory allow for reprogramming. Flash memory, invented by Fujio Masuoka, is widely used in storage devices like thumb drives and solid-state drives .
**ROMs can be represented using dot notation:**
* A dot in a bit cell indicates that a '1' is stored .
* The absence of a dot can indicate a '0', or a specific configuration might be implied for '0' .
> **Example:** A 22 × 3-bit ROM can be used to implement logic functions. (#page=118, 121) If the address lines are A1 and A0, and the outputs are Data2, Data1, and Data0, the functions can be mapped as follows :
> * Data2 = A1 $\oplus$ A0 (#page=120, 122) .
> * Data1 = A1 + A0 (#page=120, 122) .
> * Data0 = A1 $\cdot$ A0 (#page=120, 122) .
> The memory array effectively acts as a lookup table (LUT), where each input combination (address) maps to a pre-programmed output .
#### 4.3.4 Logic with memory arrays
Memory arrays, particularly ROMs, can be utilized to implement arbitrary logic functions. (#page=121, 122, 123) By programming the bit cells, the memory array can store the truth table of a desired logic function. The address lines correspond to the logic function's inputs, and the data lines correspond to its outputs .
#### 4.3.5 SystemVerilog Memory Implementation
SystemVerilog provides constructs for modeling RAM and ROM:
* **RAM:** A RAM can be declared as an array of logic vectors, with read and write operations controlled by clock and write enable signals .
* **ROM:** A ROM is also declared as an array, but its contents are typically initialized from a file (`USDreadmemh` in SystemVerilog) at simulation startup, and it only supports read operations. (#page=127, 128) .
#### 4.3.6 Multi-ported memories
Multi-ported memories are memory structures that allow multiple read and/or write operations to occur concurrently. Each "port" provides an independent address/data interface. A common example is a register file, which often has multiple read ports and one or more write ports to support simultaneous access by different functional units within a processor. (#page=129, 130) A 32x32 register file with two read ports and one write port is illustrated in SystemVerilog .
---
# Logic arrays and FPGAs
This topic explores programmable logic devices, specifically Programmable Logic Arrays (PLAs) and Field-Programmable Gate Arrays (FPGAs), detailing their structures and functionalities for implementing digital circuits .
### 5.1 Programmable Logic Arrays (PLAs)
PLAs are a type of programmable logic device that implement combinational logic functions. Their internal structure consists of two programmable arrays: an AND array followed by an OR array .
#### 5.1.1 Structure of a PLA
A PLA takes inputs and generates implicants using an AND array, which are then combined in an OR array to produce the final outputs. The connections within these arrays are fixed, but the programming allows for the selection of which input lines are connected to which AND gates and which AND gate outputs are connected to which OR gates .
> **Tip:** The structure of a PLA is fundamentally a sum-of-products implementation where both the product terms (AND array) and the sum terms (OR array) are programmable .
**Example:**
Consider the functions:
* $X = ABC + ABC$
* $Y = AB$
These can be implemented in a PLA where the AND array generates the product terms $ABC$, $ABC$, and $AB$, and the OR array combines $ABC$ and $ABC$ to form $X$, and $AB$ to form $Y$. The dot notation in diagrams represents a programmable connection .
### 5.2 Field-Programmable Gate Arrays (FPGAs)
FPGAs are another type of programmable logic device that offer more flexibility than PLAs. They are characterized by an array of configurable Logic Elements (LEs) and programmable interconnections. FPGAs can implement both combinational and sequential logic .
#### 5.2.1 Components of an FPGA
A typical FPGA is composed of several key elements :
* **Logic Elements (LEs):** These are the fundamental units that perform logic operations .
* **Input/Output Elements (IOEs):** These interface the internal logic of the FPGA with the external world .
* **Programmable Interconnection:** This network of wires and switches allows LEs and IOEs to be connected flexibly .
* **Other Building Blocks:** Some FPGAs may also include specialized blocks like multipliers and RAMs .
#### 5.2.2 Logic Elements (LEs)
Each LE is designed to implement logic functions. A standard LE is composed of :
* **Lookup Tables (LUTs):** These are memory elements that can implement any combinational logic function of their inputs .
* **Flip-flops:** These provide sequential logic capabilities, allowing the LE to store state .
* **Multiplexers:** These are used to route signals and select between combinational and registered outputs, connecting LUTs and flip-flops .
**Example: Altera Cyclone IV LE**
An Altera Cyclone IV LE, for instance, contains one four-input LUT, one registered output, and one combinational output. This structure allows it to implement functions of up to four variables using the LUT and provides both a combinational and a sequential output .
#### 5.2.3 LE Configuration Examples
The configuration of LEs and the calculation of required LEs for specific functions are crucial for FPGA design.
**Example 1: XOR Chain**
To implement the function $Y = A_1 \oplus A_2 \oplus A_3 \oplus A_4 \oplus A_5 \oplus A_6$ using Altera Cyclone IV LEs:
* **Required LEs:** 2 .
* **Explanation:** The first LE can compute $Y_1 = A_1 \oplus A_2 \oplus A_3 \oplus A_4$, which is a function of 4 variables and fits within a single LE's LUT. The second LE can then compute $Y = Y_1 \oplus A_5 \oplus A_6$, a function of 3 variables, utilizing the registered output of the first LE and two additional inputs .
**Example 2: 32-bit 2:1 Multiplexer**
To implement a 32-bit 2:1 multiplexer using Altera Cyclone IV LEs:
* **Required LEs:** 32 .
* **Explanation:** A 1-bit multiplexer is a function of 3 variables (data input 1, data input 2, and select line), which can be implemented within a single LE. Therefore, a 32-bit multiplexer requires 32 such 1-bit multiplexer implementations, each requiring one LE .
**Example 3: Arbitrary Finite State Machine (FSM)**
To implement an arbitrary FSM with 2 bits of state, 2 inputs, and 3 outputs using Altera Cyclone IV LEs:
* **Required LEs:** 5 .
* **Explanation:**
* Each bit of state requires an LE to hold the current state bit and implement the next state logic. The next state logic is a function of the current state (2 bits) and the inputs (2 bits), totaling 4 variables, which fits in one LE. Thus, 2 LEs are needed for the two state bits .
* Each output bit requires an LE to compute its value. The output logic is typically a function of the current state (2 bits). Thus, 3 LEs are needed for the three output bits .
* Total LEs = 2 (for state) + 3 (for outputs) = 5 LEs .
### 5.3 FPGA Design Flow
The process of designing for an FPGA typically involves using specialized Computer-Aided Design (CAD) tools, such as Altera's Quartus II. The general flow includes :
1. **Design Entry:** Creating the circuit design using schematic entry or a Hardware Description Language (HDL) .
2. **Simulation:** Verifying the design's functionality through simulation .
3. **Synthesis and Mapping:** The CAD tool translates the design into a netlist and maps it onto the available LEs and interconnections of the target FPGA .
4. **Configuration Download:** The synthesized design is compiled into a configuration file, which is then downloaded onto the FPGA to program its logic and routing .
5. **Testing:** The implemented design on the FPGA is tested to ensure it functions correctly .
---
## Common mistakes to avoid
- Review all topics thoroughly before exams
- Pay attention to formulas and key definitions
- Practice with examples provided in each section
- Don't memorize without understanding the underlying concepts
Glossary
| Term | Definition |
|------|------------|
| Half Adder | A combinational logic circuit that performs the addition of two single binary digits, producing a sum bit and a carry-out bit. The sum bit is the XOR of the two inputs, and the carry-out is the AND of the two inputs. |
| Full Adder | A combinational logic circuit that adds three single binary digits: two input bits and a carry-in bit from a previous stage. It produces a sum bit and a carry-out bit. Its logic is typically defined as $S = A \oplus B \oplus C_{in}$ and $C_{out} = AB + AC_{in} + BC_{in}$. |
| Ripple-Carry Adder | A multibit adder where the carry-out of each stage is fed as the carry-in to the next stage. This creates a chain reaction, or ripple, of carry propagation from the least significant bit to the most significant bit, leading to a slow execution time that depends on the number of bits. |
| Carry-Lookahead Adder (CLA) | A type of multibit adder that significantly speeds up carry propagation. It achieves this by pre-calculating the carry-in for larger blocks of bits using generate ($G_i$) and propagate ($P_i$) signals, thereby reducing the delay associated with a ripple-carry adder. |
| Generate Signal ($G_i$) | In carry-lookahead addition, the generate signal for column $i$, defined as $G_i = A_i B_i$. If this signal is 1, a carry is generated in that column regardless of the carry-in. |
| Propagate Signal ($P_i$) | In carry-lookahead addition, the propagate signal for column $i$, defined as $P_i = A_i + B_i$. If this signal is 1, a carry-in to that column will propagate through to the carry-out. |
| Block Propagate Signal | In carry-lookahead adders using blocks of bits, this signal indicates whether a carry-in to the block will propagate through all bits of that block to become the block's carry-out. For a 4-bit block, it is calculated as $P_{3:0} = P_3P_2P_1P_0$. |
| Block Generate Signal | In carry-lookahead adders using blocks of bits, this signal indicates whether a carry will be generated within the block. For a 4-bit block, it is calculated as $G_{3:0} = G_3 + G_2P_3 + G_1P_2P_3 + G_0P_1P_2P_3$. |
| Prefix Adder | A highly parallel adder architecture that computes carry prefixes, which are signals indicating whether a carry will be generated or propagated through a range of bits. This architecture allows for very fast carry computation, with a delay proportional to the logarithm of the number of bits. |
| Subtracter | A digital circuit that performs subtraction. It can be implemented using an adder by taking the two's complement of the subtrahend and adding it to the minuend, i.e., $A - B = A + (\text{NOT } B) + 1$. |
| Comparator | A digital circuit that compares two binary numbers to determine their relationship (e.g., equal, less than, greater than). Equality comparators check if all corresponding bits are the same, while signed comparators often use subtraction and check the status flags of the ALU. |
| Arithmetic Logic Unit (ALU) | A fundamental digital circuit that performs arithmetic and logic operations on operands. It typically includes an adder, subtracter, logic gates (AND, OR, XOR), and multiplexers controlled by an operation code to select the desired function. |
| Status Flags | Bits that indicate the outcome of an ALU operation. Common flags include Negative (N, most significant bit of the result), Zero (Z, set if the result is all zeros), Carry (C, generated by the adder's carry-out), and Overflow (V, indicating an invalid result for signed arithmetic). |
| Shifter | A digital circuit that shifts the bits of a binary word to the left or right. Logical shifters fill empty positions with zeros, while arithmetic shifters fill empty positions with the sign bit during right shifts to preserve the number's sign. Rotators shift bits circularly. |
| Multiplier | A digital circuit that performs multiplication of two binary numbers. It typically involves generating partial products by multiplying the multiplicand with each bit of the multiplier and then summing these partial products. |
| Divider | A digital circuit that performs division of two binary numbers, producing a quotient and a remainder. The process often involves repeated subtraction and shifting operations, similar to long division. |
| Fixed-Point Number | A number representation where the binary point (radix point) has a fixed position. The number of bits allocated for the integer part and the fractional part is predefined, allowing for a specific range and precision. |
| Floating-Point Number | A number representation that allows for a wider dynamic range than fixed-point numbers. It consists of a sign bit, an exponent, and a mantissa (fraction), similar to scientific notation. The IEEE 754 standard is commonly used for its representation. |
| Mantissa | The fractional part of a floating-point number, typically normalized such that the most significant bit is always 1 (implicit leading 1). It determines the precision of the number. |
| Exponent | The part of a floating-point number that determines the magnitude or scale of the number. It is often stored in a biased form to represent both positive and negative exponents. |
| IEEE 754 Standard | A technical standard for floating-point arithmetic that defines formats for representing numbers (single-precision 32-bit, double-precision 64-bit) and specifying the behavior of arithmetic operations, including special values like infinity and NaN (Not a Number). |
| Saturating Arithmetic | A type of arithmetic that prevents overflow by clamping the result to the maximum or minimum representable value when an overflow occurs, rather than wrapping around. This is useful in applications like signal processing to avoid audible or visible artifacts. |
| Counter | A sequential logic circuit that generates a sequence of distinct values in response to clock pulses. Counters are used for timing, sequencing operations, and in applications like digital clocks and program counters. |
| Shift Register | A sequential logic circuit that shifts its stored bits one position to the left or right on each clock pulse. They are used for serial-to-parallel conversion, parallel-to-serial conversion, and creating time delays. |
| Memory Array | A two-dimensional arrangement of bit cells used to store digital data. It consists of rows (wordlines) and columns (bitlines), addressed by an address decoder to select specific memory words for reading or writing. |
| DRAM (Dynamic Random Access Memory) | A type of RAM that stores each bit of data in a separate capacitor within an integrated circuit. It requires periodic refreshing to maintain data due to charge leakage from the capacitors. |
| SRAM (Static Random Access Memory) | A type of RAM that uses flip-flops (typically cross-coupled inverters) to store each bit. It does not require refreshing as long as power is supplied and is generally faster but more expensive and less dense than DRAM. |
| ROM (Read-Only Memory) | A type of non-volatile memory that stores data permanently or semi-permanently. Data is read quickly, but writing is typically impossible or very slow. Flash memory is a modern type of ROM. |
| PLA (Programmable Logic Array) | A type of programmable logic device that consists of a programmable AND array followed by a programmable OR array. It is used to implement combinational logic functions. |
| FPGA (Field-Programmable Gate Array) | A type of programmable logic device that consists of an array of configurable logic elements (LEs), programmable interconnections, and input/output blocks. FPGAs can implement both combinational and sequential logic and are widely used for prototyping and custom hardware development. |
| Logic Element (LE) | The basic building block within an FPGA. An LE typically contains lookup tables (LUTs) for combinational logic and flip-flops for sequential logic, along with multiplexers for configuration. |
| LUT (Lookup Table) | A combinational logic circuit implemented as a small memory. A $k$-input LUT can implement any boolean function of $k$ variables by storing the function's truth table in its memory locations. |
Cover
DDCArv_Ch6.pdf
Summary
# Introduction to computer architecture and assembly language
This section introduces computer architecture as the programmer's perspective of a computer system, emphasizing its instruction set and operand locations, and contrasts it with microarchitecture, while also detailing assembly language and the RISC-V architecture's origins and design principles [3](#page=3).
### 1.1 Computer architecture: the programmer's view
Computer architecture is defined as the programmer's view of a computer. It is characterized by the set of instructions the computer understands and where it can find the data (operand locations) that these instructions operate on. This contrasts with microarchitecture, which focuses on the hardware implementation of that architecture [3](#page=3).
### 1.2 Assembly language and machine language
* **Instructions:** These are the fundamental commands that a computer executes [4](#page=4).
* **Assembly language:** This is a human-readable representation of machine instructions, making it easier for programmers to write and understand code [4](#page=4).
* **Machine language:** This is the computer-readable format of instructions, typically represented as sequences of ones and zeros [4](#page=4).
### 1.3 The RISC-V architecture
The RISC-V architecture was developed at the University of California, Berkeley, starting in 2010 by Krste Asanovic, David Patterson, and their colleagues. It is recognized as the first widely accepted open-source computer architecture [4](#page=4).
#### 1.3.1 Key figures in RISC-V development
* **Krste Asanovic:** A Professor of Computer Science at UC Berkeley, he developed RISC-V during a summer and is the Chairman of the Board of the RISC-V Foundation. He is also a co-founder of SiFive [5](#page=5).
* **Andrew Waterman:** A co-founder of SiFive, Waterman was instrumental in co-designing the RISC-V architecture and its first cores, driven by dissatisfaction with existing instruction set architectures (ISAs). He earned his PhD from UC Berkeley in 2016 [6](#page=6).
* **David Patterson:** A Professor of Computer Science at UC Berkeley since 1976, Patterson, along with John Hennessy, is credited with coinventing the Reduced Instruction Set Computer (RISC) architecture in the 1980s. He was a founding member of the RISC-V team and received the Turing Award with John Hennessy for their pioneering work in the quantitative design and evaluation of computer architectures [7](#page=7).
* **John Hennessy:** Served as President of Stanford University from 2000 to 2016 and has been a Professor of Electrical Engineering and Computer Science at Stanford since 1977. He also coinvented RISC with David Patterson and shared the Turing Award with him for their contributions to computer architecture [8](#page=8).
### 1.4 Architecture design principles
Hennessy and Patterson articulated several underlying design principles for computer architectures [9](#page=9):
1. **Simplicity favors regularity:** Consistent and predictable design choices simplify the architecture [9](#page=9).
2. **Make the common case fast:** Prioritize performance for the most frequently occurring operations [9](#page=9).
3. **Smaller is faster:** Generally, simpler and smaller designs lead to faster execution [9](#page=9).
4. **Good design demands good compromises:** Effective architectural design often involves balancing competing requirements and making trade-offs [9](#page=9).
> **Tip:** Understanding the distinction between architecture and microarchitecture is crucial. Architecture defines *what* the programmer sees and interacts with, while microarchitecture defines *how* that architecture is physically implemented in hardware. This study guide focuses on the former.
>
> **Tip:** Learning about RISC-V is valuable because its open-source nature makes it accessible and fosters a deeper understanding of computer instruction sets. Once you master one architecture, learning others becomes significantly easier [4](#page=4).
---
# RISC-V instruction set architecture: operands and instructions
This topic delves into the fundamental components of the RISC-V instruction set, covering its arithmetic and logical instructions, operand types, and memory access mechanisms, all while emphasizing core RISC design principles [10](#page=10) [11](#page=11) [16](#page=16).
### 2.1 Core instruction set principles
RISC-V adheres to design principles that prioritize simplicity and efficiency for common operations [15](#page=15).
#### 2.1.1 Simplicity favors regularity
A key design principle is that simplicity favors regularity. This means RISC-V uses a consistent instruction format, with most arithmetic and logical instructions having two source operands and one destination operand. This regularity makes instructions easier to encode and handle in hardware [13](#page=13).
#### 2.1.2 Make the common case fast
Another crucial principle is to make the common case fast. RISC-V achieves this by including a small set of simple, frequently used instructions. This allows for simple, small, and fast hardware for decoding and executing these instructions. Less common, more complex instructions are implemented by combining multiple simple RISC-V instructions. This contrasts with Complex Instruction Set Computers (CISC) like Intel's x86, which feature a larger, more complex set of instructions [15](#page=15).
#### 2.1.3 Smaller is faster
The principle of "smaller is faster" is also evident in RISC-V's design, particularly in its limited number of registers [19](#page=19).
### 2.2 Instruction types
RISC-V provides fundamental instructions for arithmetic, logical operations, and memory access.
#### 2.2.1 Arithmetic and logical instructions
Basic arithmetic operations are supported with straightforward mnemonics.
* **Addition:** The `add` instruction performs addition. It takes two source operands and writes the result to a destination operand [11](#page=11).
* C Code: `a = b + c;`
* RISC-V Assembly: `add a, b, c` [11](#page=11).
* **Subtraction:** The `sub` instruction performs subtraction, following the same operand structure as addition [12](#page=12).
* C Code: `a = b - c;`
* RISC-V Assembly: `sub a, b, c` [12](#page=12).
#### 2.2.2 Handling complex operations
More complex operations, which might be single instructions in CISC architectures, are decomposed into multiple RISC-V instructions [14](#page=14).
* **Example:** `a = b + c - d;`
* RISC-V Assembly:
```assembly
add t, b, c # t = b + c
sub a, t, d # a = t - d
```
### 2.3 Operands
Operands are the physical locations from which data is fetched or to which data is written. In RISC-V, these include registers, memory locations, and constants (immediates) [17](#page=17).
#### 2.3.1 Registers
RISC-V systems feature 32 32-bit registers, which are significantly faster than memory access. The architecture is referred to as "32-bit" because it primarily operates on 32-bit data [18](#page=18).
##### 2.3.1.1 RISC-V register set
The RISC-V register set has specific names and numbers assigned to each register, along with their designated usage [20](#page=20).
| Name | Register Number | Usage |
| :------ | :-------------- | :----------------------------------- |
| `zero` | `x0` | Constant value 0 |
| `ra` | `x1` | Return address |
| `sp` | `x2` | Stack pointer |
| `gp` | `x3` | Global pointer |
| `tp` | `x4` | Thread pointer |
| `t0`-`t2` | `x5`-`x7` | Temporaries |
| `s0`/`fp` | `x8` | Saved register / Frame pointer |
| `s1` | `x9` | Saved register |
| `a0`-`a1` | `x10`-`x11` | Function arguments / return values |
| `a2`-`a7` | `x12`-`x17` | Function arguments |
| `s2`-`s11`| `x18`-`x27` | Saved registers |
| `t3`-`t6` | `x28`-`x31` | Temporaries |
##### 2.3.1.2 Register usage conventions
Registers can be referred to by their names (e.g., `ra`, `zero`) or their numbers (e.g., `x1`, `x0`). Using names is generally preferred for readability. Specific registers have conventional uses [21](#page=21):
* `zero` always holds the constant value 0 [21](#page=21).
* Saved registers (`s0`-`s11`) are used to preserve variable values across function calls [21](#page=21).
* Temporary registers (`t0`-`t6`) are used for intermediate values during calculations [21](#page=21).
##### 2.3.1.3 Instructions involving registers
Instructions that operate on registers use the defined naming conventions [22](#page=22).
* **Example:** `a = b + c;` (where `a` maps to `s0`, `b` to `s1`, and `c` to `s2`)
* RISC-V Assembly: `add s0, s1, s2` [22](#page=22).
#### 2.3.2 Constants (Immediates)
Constants, also known as immediates, are literal values embedded directly within instructions.
* **Instructions with constants:** The `addi` instruction allows adding a constant to a register [23](#page=23).
* C Code: `a = b + 6;` (where `a` maps to `s0` and `b` to `s1`)
* RISC-V Assembly: `addi s0, s1, 6` [23](#page=23).
* **Generating 12-bit constants:** The `addi` instruction can handle 12-bit signed constants. Any constant requiring more than 12 bits cannot be generated this way [37](#page=37).
* **Example:** `int a = -372; int b = a + 6;` (where `a` maps to `s0` and `b` to `s1`)
* RISC-V Assembly:
```assembly
addi s0, zero, -372
addi s1, s0, 6
```
* **Generating 32-bit constants:** Larger constants require a combination of instructions. The `lui` (load upper immediate) instruction and `addi` are used together [38](#page=38).
* The `lui` instruction places an immediate value into the upper 20 bits of a destination register, zeroing out the lower 12 bits [38](#page=38).
* The `addi` instruction is then used to add the lower 12 bits of the constant. It is important to note that `addi` sign-extends its 12-bit immediate [38](#page=38).
* **Example:** `int a = 0xFEDC8765;` (where `a` maps to `s0`)
* RISC-V Assembly:
```assembly
lui s0, 0xFEDC8
addi s0, s0, 0x765
```
* **Handling constants with bit 11 set to 1:** If the 11th bit of the 32-bit constant is 1, the upper 20 bits loaded by `lui` need to be incremented by 1 to correctly form the final constant due to sign extension in `addi` [39](#page=39).
* **Example:** `int a = 0xFEDC8EAB;` (where `a` maps to `s0`)
* RISC-V Assembly:
```assembly
lui s0, 0xFEDC9 # s0 = 0xFEDC9000
addi s0, s0, -341 # s0 = 0xFEDC9000 + 0xFFFFFEAB = 0xFEDC8EAB
```
(Note: -341 is represented as `0xEAB` in hexadecimal for the lower 12 bits after sign extension) [39](#page=39).
#### 2.3.3 Memory operands
When data exceeds the capacity of registers, it is stored in memory. While memory is large, it is slower than registers [25](#page=25).
##### 2.3.3.1 Memory addressing schemes
RISC-V memory can be conceptualized as either word-addressable or byte-addressable. RISC-V itself is byte-addressable [26](#page=26).
* **Word-addressable memory:** In this scheme, each 32-bit word has a unique address. A word in this context is 4 bytes wide [27](#page=27).
* **Load Word (`lw`):** To read a word from memory, the `lw` instruction is used. It specifies a destination register and an address calculated by adding an offset to a base register [28](#page=28).
* Format: `lw destination, offset(base)` [28](#page=28).
* Address Calculation: `address = base + offset` [28](#page=28).
* **Example:** Reading memory word 1 into `s3`. The address is calculated as `(zero + 1) = 1` [29](#page=29).
* RISC-V Assembly: `lw s3, 1(zero)` [29](#page=29).
* **Store Word (`sw`):** To write a word to memory, the `sw` instruction is used. It takes a source register (whose value is to be stored) and an address calculated similarly to `lw` [30](#page=30) [31](#page=31).
* Format: `sw source, offset(base)` [31](#page=31).
* **Example:** Storing the value in `t4` into memory word 3. The address is `(zero + 0x3) = 3` [31](#page=31).
* RISC-V Assembly: `sw t4, 0x3(zero)` [31](#page=31).
* The offset can be specified in decimal or hexadecimal [31](#page=31).
* **Byte-addressable memory:** In this scheme, each individual byte has a unique address. Since a 32-bit word consists of 4 bytes, word addresses increment by 4. RISC-V operates on this model [32](#page=32) [33](#page=33).
* **Address Calculation in Byte-Addressable Memory:** For `lw` and `sw` in a byte-addressable system, the memory word address must be multiplied by 4 to get the actual byte address. For instance, the address of memory word 2 is `2 * 4 = 8` [33](#page=33).
* **Example (Load Word):** Loading a word at memory address 8 into `s3` [34](#page=34).
* RISC-V Assembly: `lw s3, 8(zero)` [34](#page=34).
* **Example (Store Word):** Storing the value from `t7` into memory address `0x10` (which is word 4) [35](#page=35).
* RISC-V Assembly: `sw t7, 0x10(zero)` [35](#page=35).
* **Load Byte (`lb`) and Store Byte (`sb`):** RISC-V also supports instructions for loading and storing individual bytes from or to memory, such as `lb` and `sb` [32](#page=32).
---
# Control flow in RISC-V: branches, jumps, loops, and function calls
This section details how RISC-V processors manage program execution flow through branches, jumps, and function call mechanisms, and how these instructions facilitate the implementation of high-level control structures [41](#page=41) [51](#page=51).
### 3.1 Branching instructions
Branching allows the processor to execute instructions out of their sequential order. RISC-V supports two main types of branches: conditional and unconditional [52](#page=52).
#### 3.1.1 Conditional branching
Conditional branches execute a jump to a different instruction address only if a specified condition is met. Common conditional branch instructions include:
* `beq rs1, rs2, label`: Branch if equal [52](#page=52).
* `bne rs1, rs2, label`: Branch if not equal [52](#page=52).
* `blt rs1, rs2, label`: Branch if less than [52](#page=52).
* `bge rs1, rs2, label`: Branch if greater than or equal [52](#page=52).
These instructions compare the values in two source registers (`rs1` and `rs2`) and, if the condition is true, the program counter (PC) is updated to the address specified by `label`. If the condition is false, execution continues to the next instruction in sequence. Labels are symbolic names that represent instruction addresses and are followed by a colon (`:`) [53](#page=53) [54](#page=54).
> **Tip:** Assembly code often tests the opposite condition of the high-level language construct to simplify branching logic (e.g., testing `i!= j` to implement an `if (i == j)` statement) [58](#page=58) [60](#page=60).
#### 3.1.2 Unconditional branching
Unconditional branches always redirect program execution to a specified address. The primary unconditional branch instruction is:
* `j label`: Jump [52](#page=52).
This instruction directly changes the PC to the address indicated by `label`, causing any instructions between the `j` instruction and the `label` to be skipped [55](#page=55).
### 3.2 Jumps and Jumps with Link
Jumps, particularly "jump and link" instructions, are crucial for implementing function calls and more complex control flow. RISC-V has two primary jump instructions:
* `jal rd, imm20:0`: Jump and link. This instruction saves the address of the next instruction (PC + 4) into a destination register (`rd`) and then updates the PC to the address specified by the immediate offset. This is fundamental for returning from functions .
* `jalr rd, rs, imm11:0`: Jump and link register. This instruction saves the address of the next instruction (PC + 4) into `rd`, and then updates the PC to the sum of the value in a source register (`rs`) and a sign-extended immediate offset .
### 3.3 Pseudoinstructions
Pseudoinstructions are convenient mnemonics that the assembler translates into actual RISC-V machine instructions. They simplify programming by providing higher-level abstractions .
#### 3.3.1 Jump pseudoinstructions
Several pseudoinstructions facilitate jumps:
* `j label`: Translates to `jal x0, label`. It performs an unconditional jump to `label` without saving the return address .
* `jal label`: Translates to `jal ra, label`. It jumps to `label` and saves the return address in the `ra` (return address) register .
* `jr rs`: Translates to `jalr x0, rs, 0`. It jumps to the address in register `rs` without saving the return address .
* `ret`: Translates to `jr ra` or `jalr x0, ra, 0`. This is the standard instruction for returning from a function .
Labels are used to specify jump targets, and the immediate offset is calculated as the number of bytes past the jump instruction .
#### 3.3.2 Long jumps
The immediate offsets in `jal` (20 bits) and `jalr` (12 bits) limit the range of jumps. For longer jumps, RISC-V uses the `auipc rd, imm` instruction, which adds an upper immediate to the PC (`rd = PC + {imm31:12, 12'b0}`). The pseudoinstruction `call imm31:0` combines `auipc` and `jalr` to achieve a 32-bit offset jump .
### 3.4 Implementing Control Structures
Branch and jump instructions are the building blocks for high-level control structures like conditional statements and loops [57](#page=57).
#### 3.4.1 If statements
An `if` statement, such as `if (i == j) f = g + h;`, can be implemented by branching over the code block if the condition is *not* met. For instance, to execute `add s0, s1, s2` when `s3` (i) equals `s4` (j), one would use `bne s3, s4, L1` to skip the `add` if `i` is not equal to `j`. The code then continues after the `if` block [58](#page=58).
#### 3.4.2 If/Else statements
`if/else` statements require an unconditional jump to skip the `else` block if the `if` condition is true. For `if (i == j) f = g + h; else f = f - i;` [59](#page=59):
1. Use `bne s3, s4, L1` to branch to `L1` if `i != j`.
2. If `i == j`, execute `add s0, s1, s2`.
3. Immediately after the `add`, use `j done` to skip the `else` block.
4. At `L1`, execute the `else` block: `sub s0, s0, s3`.
5. The code then proceeds from `done:`.
#### 3.4.3 While loops
A `while` loop typically checks a condition at the beginning of each iteration. For `while (pow != 128)`, the loop structure is:
1. Check the loop condition using a branch (e.g., `beq s0, t0, done` if `s0` is `pow` and `t0` is 128). If true, exit the loop.
2. Execute the loop body (e.g., `slli s0, s0, 1` for `pow = pow * 2` and `addi s1, s1, 1` for `x = x + 1`).
3. Use an unconditional jump (`j while`) to return to the loop condition check.
4. The `done:` label marks the exit point of the loop [60](#page=60).
> **Tip:** Similar to `if` statements, `while` loops in assembly often test the inverse condition to jump out of the loop when the termination condition is met [60](#page=60).
#### 3.4.4 For loops
A `for` loop has three parts: initialization, condition, and loop operation [61](#page=61).
* **Initialization:** Executes once before the loop begins.
* **Condition:** Tested at the start of each iteration.
* **Loop Operation:** Executes at the end of each iteration.
* **Statement:** The loop body.
A standard `for` loop like `for (i=0; i!=10; i = i+1) { sum = sum + i; }` can be implemented as:
1. **Initialization:** `addi s1, zero, 0` (sum=0), `addi s0, zero, 0` (i=0).
2. **Condition:** `beq s0, t0, done` (if i == 10, branch to `done`).
3. **Statement:** `add s1, s1, s0` (sum = sum + i).
4. **Loop Operation:** `addi s0, s0, 1` (i = i + 1).
5. **Jump back:** `j for` to re-evaluate the condition [62](#page=62).
For loops involving "less than" conditions, such as `for (i=1; i < 101; i = i*2)`, instructions like `bge` (branch if greater or equal) or `slt` (set if less than) can be used [63](#page=63) [64](#page=64).
* Using `bge`: `bge s0, t0, done` where `s0` is `i`, `t0` is 101. If `i` is greater than or equal to 101, exit the loop.
* Using `slt`: `slt t2, s0, t0` sets `t2` to 1 if `s0 < t0`. Then, `beq t2, zero, done` checks if `t2` is zero (meaning `s0` was not less than `t0`), exiting the loop if it is [64](#page=64).
#### 3.4.5 Arrays
Arrays provide access to large amounts of similar data using an index. Accessing an array element involves [66](#page=66):
1. Loading the base address of the array into a register [67](#page=67).
2. Calculating the byte offset for the desired element: `byte_offset = index * element_size`. For 4-byte words, this is `index * 4`, which can be efficiently done with a left shift: `slli t0, s1, 2` where `s1` is the index [70](#page=70).
3. Calculating the memory address: `address = base_address + byte_offset`.
4. Using `lw` (load word) or `sw` (store word) to access the element at that address [68](#page=68) [70](#page=70).
When accessing arrays of characters (strings), the null terminator character (`\0`) is often used to determine the end of the string. A `while` loop can iterate until it encounters this null character, incrementing a length counter [73](#page=73).
### 3.5 Function Calls
Function calls enable modularity and code reuse by allowing one part of the program (the caller) to execute another part (the callee) and then return [75](#page=75).
#### 3.5.1 Calling conventions
A function call follows a set of conventions:
* **Caller:** Passes arguments to the callee and then jumps to the callee's address. After the callee returns, the caller retrieves the return value and resumes execution.
* **Callee:** Performs the function's task, returns a result (if any), and returns control to the caller at the correct instruction. Crucially, the callee must not modify registers or memory that the caller relies on without saving and restoring them [77](#page=77).
#### 3.5.2 RISC-V function calling conventions
RISC-V specifies these conventions:
* **Function Call:** `jal func` (jump and link to the function named `func`) [78](#page=78).
* **Function Return:** `jr ra` (jump register, returning to the address stored in `ra`) [78](#page=78).
* **Arguments:** Passed in registers `a0` through `a7` [78](#page=78).
* **Return Value:** Placed in register `a0` [78](#page=78).
For functions with more than eight arguments, subsequent arguments are typically passed on the stack [79](#page=79) [80](#page=80).
> **Example:** To call a function `diffofsums` with arguments 2, 3, 4, and 5, the caller would load these values into `a0`, `a1`, `a2`, and `a3` respectively, then execute `jal diffofsums`. The result returned in `a0` would then be stored in a caller-saved register like `s7` [80](#page=80).
#### 3.5.3 Register usage and the stack
The callee must preserve any registers it modifies that the caller might need later. These are categorized into:
* **Caller-saved (Non-preserved):** `t0`-`t6` and `a0`-`a7`. The caller is responsible for saving these if they need to retain their values across a function call [88](#page=88).
* **Callee-saved (Preserved):** `s0`-`s11` and `sp`. The callee *must* save these registers if it intends to use them and restore them before returning [88](#page=88).
The stack is used to save registers that need to be preserved by the callee. The stack grows downwards from higher to lower memory addresses, and the stack pointer (`sp`) points to the top of the stack [81](#page=81) [83](#page=83) [84](#page=84) [85](#page=85).
To save registers on the stack:
1. **Make space:** Decrement the stack pointer by the total size of the data to be stored (e.g., `addi sp, sp, -12` to save three 4-byte words) [86](#page=86).
2. **Store:** Use `sw` (store word) to save register values at offsets relative to `sp` (e.g., `sw s3, 8(sp)`) [86](#page=86).
3. **Perform computation.**
4. **Restore:** Use `lw` (load word) to retrieve saved values from the stack.
5. **Deallocate space:** Increment the stack pointer by the amount it was decremented (e.g., `addi sp, sp, 12`) [86](#page=86).
> **Tip:** When a function needs to use a callee-saved register (`s0`-`s11`), it must save its original value on the stack *before* using it and restore it *after* it has finished using it and before returning [89](#page=89).
#### 3.5.4 Non-leaf function calls
A non-leaf function is one that calls another function. Before calling another function, a non-leaf function must save its own return address (`ra`) on the stack, in addition to any other registers it needs to preserve. This is because the subsequent `jal` instruction to call the next function will overwrite `ra` with the return address for the *current* function. After the called function returns, the non-leaf function restores `ra` and any other saved registers before returning to its own caller [91](#page=91) [92](#page=92) [93](#page=93).
> **Example:** If `f1` calls `f2`, `f1` must save `ra`, `s4`, and `s5` on the stack before calling `f2`. `f2` might also use the stack to save its own registers, like `s4`. When `f2` returns, `f1` restores its saved registers (including `ra`) and then returns to its caller [92](#page=92) [93](#page=93).
#### 3.5.5 Recursive functions
A recursive function is one that calls itself. Implementing a recursive function in assembly involves a two-pass approach:
1. **Pass 1:** Treat the recursive call as if it were a call to a different function, ignoring potential register overwrites and stack usage for now [96](#page=96) [99](#page=99).
2. **Pass 2:** Identify which registers are overwritten by the recursive call and are needed *after* the call returns. These registers, along with the return address (`ra`), must be saved on the stack before the recursive call and restored after [100](#page=100) [99](#page=99).
For a factorial function: `factorial(n) = n * factorial(n-1)`. The base case is `factorial = 1` [1](#page=1) [97](#page=97) [98](#page=98).
* The argument `n` is passed in `a0`.
* If `n <= 1`, return 1 by placing 1 in `a0` and jumping to `jr ra`.
* If `n > 1`, the function needs to:
1. Save `a0` (the current `n`) and `ra` on the stack [100](#page=100).
2. Decrement `a0` to `n-1` for the recursive call: `addi a0, a0, -1`.
3. Make the recursive call: `jal factorial`.
4. After the call returns, `a0` holds `factorial(n-1)`. Restore the original `n` from the stack into a temporary register (e.g., `t1`) [100](#page=100).
5. Restore `ra` from the stack.
6. Deallocate stack space.
7. Calculate the final result: `mul a0, t1, a0` (n * factorial(n-1)) [100](#page=100).
8. Return: `jr ra`.
The stack grows with each recursive call and shrinks as the function returns, creating nested stack frames for each invocation .
---
# RISC-V machine language and instruction formats
This topic details the binary representation of RISC-V instructions, including their various formats, fields, and how assembly instructions translate to machine code.
### 4.1 Introduction to machine language
Computers fundamentally understand only binary sequences (1s and 0s). Machine language is the direct binary representation of instructions that a processor can execute. RISC-V instructions are typically 32-bit in length, a design choice that supports regularity in both data and instruction sizes. To manage this complexity, RISC-V defines four primary instruction formats: R-Type, I-Type, S/B-Type, and U/J-Type .
### 4.2 Instruction formats
The RISC-V instruction set architecture (ISA) employs several distinct instruction formats, each designed to efficiently encode different types of operations and operand addressing. These formats share common fields like the opcode, which dictates the basic operation, but differ in the arrangement and interpretation of other fields, particularly for immediate values and register operands .
#### 4.2.1 R-Type format
The R-Type (Register-Type) format is used for instructions that operate on three register operands: two source registers (`rs1`, `rs2`) and one destination register (`rd`). It also includes `funct7` and `funct3` fields, which, in conjunction with the `opcode`, specify the exact operation to be performed .
**Structure:**
The R-Type format is structured as follows:
| `funct7` (7 bits) | `rs2` (5 bits) | `rs1` (5 bits) | `funct3` (3 bits) | `rd` (5 bits) | `opcode` (7 bits) |
| :---------------- | :------------ | :------------ | :---------------- | :----------- | :---------------- |
| 31:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:0 |
**Example:** The `add x18, x19, x20` assembly instruction might translate to a specific R-Type binary representation where `rs1` points to register `x19`, `rs2` to `x20`, `rd` to `x18`, and the `opcode`, `funct3`, and `funct7` fields specify the addition operation .
#### 4.2.2 I-Type format
The I-Type (Immediate-Type) format is used for instructions that involve two register operands (`rs1`, `rd`) and a 12-bit signed immediate value. This format is commonly used for operations like adding an immediate value to a register (`addi`) or for load instructions (`lw`) .
**Structure:**
The I-Type format is structured as follows:
| `imm[11:0]` (12 bits) | `rs1` (5 bits) | `funct3` (3 bits) | `rd` (5 bits) | `opcode` (7 bits) |
| :-------------------- | :------------ | :---------------- | :----------- | :---------------- |
| 31:20 | 19:15 | 14:12 | 11:7 | 6:0 |
**Example:** The `addi x8, x9, 12` instruction uses the I-Type format. The immediate value `12` is encoded in the `imm[11:0]` field, `rs1` points to `x9`, and `rd` points to `x8`. Load instructions like `lw t2, -6(s3)` also use this format, where the immediate is an offset from the base address in `rs1` .
#### 4.2.3 S/B-Type format
The S/B-Type formats are used for Store and Branch instructions, respectively. They share a common structure but differ in their immediate encoding and intended use .
##### 4.2.3.1 S-Type format (Store-Type)
The S-Type format is used for store instructions that write data from a register (`rs2`) to a memory location. The memory address is calculated using a base register (`rs1`) and a 12-bit signed immediate offset .
**Structure:**
The S-Type format is structured as follows:
| `imm[11:5]` (7 bits) | `rs2` (5 bits) | `rs1` (5 bits) | `imm[4:0]` (5 bits) | `funct3` (3 bits) | `opcode` (7 bits) |
| :-------------------- | :------------ | :------------ | :------------------ | :---------------- | :---------------- |
| 31:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:0 |
**Example:** The `sw x7, -6(x19)` instruction uses the S-Type format. `rs1` (x19) holds the base address, `rs2` (x7) holds the value to be stored, and the immediate `-6` is split into two parts (`imm[11:5]` and `imm[4:0]`) to form the full 12-bit offset .
##### 4.2.3.2 B-Type format (Branch-Type)
The B-Type format is used for conditional branch instructions. It typically involves two source registers (`rs1`, `rs2`) and a 13-bit signed immediate value that represents an offset from the current program counter (PC) to the target branch address. The immediate encoding is split across fields to accommodate the requirement for two source registers .
**Structure:**
The B-Type format is structured as follows:
| `imm[12:10]` (3 bits) | `rs2` (5 bits) | `rs1` (5 bits) | `imm[4:1]` (4 bits) | `imm ` (1 bit) | `funct3` (3 bits) | `opcode` (7 bits) | [11](#page=11).
| :-------------------- | :------------ | :------------ | :------------------ | :---------------- | :---------------- | :---------------- |
| 31:29 | 24:20 | 19:15 | 14:11 | 10 | 7:5 | 6:0 |
**Example:** For a `beq x8, x30, L1` instruction, the `imm` field encodes the offset to the label `L1`. The immediate value is calculated relative to the PC of the branch instruction itself .
#### 4.2.4 U/J-Type format
The U/J-Type formats cater to instructions that manipulate larger immediate values for setting register contents or for unconditional jumps .
##### 4.2.4.1 U-Type format (Upper-Immediate-Type)
The U-Type format is primarily used for the `lui` (load upper immediate) instruction. It loads the upper 20 bits of a 32-bit immediate into a destination register (`rd`), with the lower 12 bits implicitly set to zero .
**Structure:**
The U-Type format is structured as follows:
| `imm[31:12]` (20 bits) | `rd` (5 bits) | `opcode` (7 bits) |
| :--------------------- | :----------- | :---------------- |
| 31:12 | 11:7 | 6:0 |
**Example:** The `lui x21, 0x8CDEF` instruction loads the hexadecimal value `0x8CDEF` into register `x21`. The immediate `0x8CDEF` occupies the most significant 20 bits of the instruction .
##### 4.2.4.2 J-Type format (Jump-Type)
The J-Type format is used for unconditional jump instructions, specifically `jal` (jump and link). It encodes a 20-bit immediate value that forms a 21-bit offset relative to the PC. It also specifies a destination register (`rd`) for storing the return address .
**Structure:**
The J-Type format is structured as follows:
| `imm[20:10]`, `imm `, `imm[19:12]` (20 bits) | `rd` (5 bits) | `opcode` (7 bits) | [11](#page=11).
| :--------------------------------------------- | :----------- | :---------------- |
| 31:12 | 11:7 | 6:0 |
**Example:** The `jal ra, func1` instruction uses the J-Type format. The immediate value, representing the offset to `func1`, is encoded within the instruction, and the `ra` register (return address) stores the address of the instruction following `jal` .
### 4.3 Instruction fields summary
RISC-V instructions are composed of several key fields that are interpreted differently based on the instruction format. Understanding these fields is crucial for decoding machine code and understanding instruction behavior .
* **`opcode` (7 bits):** This field is present in all instruction formats and indicates the fundamental operation type .
* **`funct3` (3 bits):** Used in R, I, S, and B types to further differentiate operations within the same `opcode` .
* **`funct7` (7 bits):** Used in R-Type instructions to specify the exact operation, often distinguishing between variations like `add` and `sub` .
* **`rd` (5 bits):** The destination register for the result of the operation .
* **`rs1` (5 bits):** The first source register .
* **`rs2` (5 bits):** The second source register (used in R-Type and S-Type) .
* **`imm` (immediate):** A constant value encoded within the instruction. The size and encoding of the immediate vary significantly by instruction type (12-bit for I/S, 13-bit for B, 20-bit for U/J) .
The combination of `opcode`, `funct3`, and `funct7` uniquely identifies the instruction .
> **Tip:** The RISC-V design prioritizes simplicity and regularity, which is why the number of instruction formats is kept small, contributing to faster processor design and execution .
### 4.4 Immediate encodings
The encoding of immediate values is a critical aspect of RISC-V instruction formats, designed to efficiently embed constants within the 32-bit instruction word. The immediate bits often occupy consistent positions across different instruction formats to simplify hardware implementation .
The sign bit of a signed immediate is typically located in the most significant bit (MSB) of the instruction. Different instruction types utilize different subsets of the 32 bits for their immediate values :
* **I-Type:** Uses a 12-bit immediate .
* **S-Type:** Uses a 12-bit immediate, split into two parts .
* **B-Type:** Uses a 13-bit immediate for branch offsets .
* **U-Type:** Uses the upper 20 bits of a 32-bit immediate .
* **J-Type:** Uses a 20-bit immediate, which forms a 21-bit offset .
#### 4.4.1 Composition of 32-bit immediates
The immediate values for different instruction types are assembled from various bit ranges within the 32-bit instruction word. This strategic placement simplifies the decoding and sign-extension logic within the processor .
### 4.5 Interpreting machine code
Interpreting machine code involves converting the binary instruction back into its assembly representation and understanding its operation. This process begins by examining the `opcode` field, which dictates how the remaining bits should be parsed according to the respective instruction format .
**Steps:**
1. **Convert Hex to Binary:** Represent the machine code (often given in hexadecimal) in its full 32-bit binary form .
2. **Identify Opcode:** Extract the `opcode` bits (bits 6:0) to determine the instruction type.
3. **Parse Fields:** Based on the `opcode`, identify and extract the other fields (`funct3`, `funct7`, `rs1`, `rs2`, `rd`, `imm`) according to the instruction format .
4. **Determine Operation:** Use the `opcode`, `funct3`, and `funct7` fields to identify the specific instruction and its operation .
5. **Assemble Operands:** Reconstruct the assembly instruction using the register names and the decoded immediate values.
**Example:** Given `0x41FE83B3`, converting to binary and examining the fields reveals it's an R-Type instruction with `opcode=51`, `funct3=0`, and `funct7=0100000`. This combination decodes to the `sub` operation. Similarly, `0xFDA48393` is an I-Type instruction (`opcode=19`, `funct3=0`) corresponding to `addi` .
### 4.6 Addressing modes
Addressing modes define how the operands of an instruction are accessed in memory or registers. RISC-V supports several fundamental addressing modes :
#### 4.6.1 Register only
In this mode, both source and destination operands are found directly in CPU registers .
* **Example:** `add s0, t2, t3` .
#### 4.6.2 Immediate
The operand is a constant value directly encoded within the instruction itself. This is common for arithmetic and logical operations that involve a literal value .
* **Example:** `addi s4, t5, -73` .
#### 4.6.3 Base addressing
This mode is primarily used by load and store instructions. The effective memory address is calculated by adding a base register's content to a signed immediate offset .
* **Example:** `lw s4, 72(zero)` computes the address as `0 + 72`. `sw t2, -25(t1)` computes the address as `content(t1) - 25` .
#### 4.6.4 PC-relative addressing
This mode is used for branches and jump-and-link instructions. The target address is calculated relative to the current Program Counter (PC). The immediate value in the instruction encodes an offset from the PC of the branch/jump instruction .
* **Example:** `bne s8, s9, L1`. The label `L1`'s address is determined by an offset encoded in the `imm` field of the `bne` instruction, relative to the PC of the `bne` instruction itself .
---
# Program execution, compilation, and memory organization
Program execution, compilation, and memory organization details how software is transformed into machine instructions, managed in memory, and executed by the processor.
## 5. Program execution, compilation, and memory organization
The stored program concept is fundamental to modern computing, allowing processors to execute a sequence of instructions stored in memory. This eliminates the need for rewiring when switching applications; simply loading a new program into memory is sufficient. Program execution involves the processor fetching instructions sequentially from memory and performing the specified operations, with the Program Counter (PC) tracking the current instruction's address .
### 5.1 Compiling, assembling, and loading programs
The process of transforming high-level code into an executable program involves several stages:
* **Compiler**: Translates high-level language code (e.g., C, Java) into assembly code. Grace Hopper is credited with developing the first compiler .
* **Assembler**: Converts assembly code into machine code (object files) .
* **Linker**: Combines multiple object files and library files to create a single executable file .
* **Loader**: Loads the executable file into memory, preparing it for execution .
### 5.2 Memory organization
Memory stores both program instructions (also known as the "text" segment) and data. Data can be global/static, allocated before program execution, or dynamic, allocated during runtime. In RISC-V, memory is typically addressed from `0x00000000` to `0xFFFFFFFF`, supporting up to 4 gigabytes .
#### 5.2.1 RISC-V memory map
A typical RISC-V memory map divides the address space into several segments:
* **Text segment**: Contains program instructions .
* **Global Data**: Stores global and static variables .
* **Heap**: Used for dynamic memory allocation .
* **Stack**: Used for function calls, local variables, and return addresses. The stack pointer (`sp`) typically grows downwards from a high address .
* **Operating System & I/O**: Reserved for the operating system and input/output devices .
* **Exception Handlers**: Region for code that handles exceptions and interrupts .
#### 5.2.2 Endianness
Endianness refers to the order in which bytes are stored within a multi-byte word in memory .
* **Big-Endian**: The most significant byte (MSB) is stored at the lowest memory address .
* **Little-Endian**: The least significant byte (LSB) is stored at the lowest memory address .
> **Tip:** While the choice of endianness does not inherently matter for a single system, it becomes crucial when systems need to share data, as mismatches can lead to incorrect data interpretation .
>
> **Example:** If a register `t0` holds `0x23456789`, and this value is stored to memory, in a big-endian system, byte `0x89` would be at the lowest address. In a little-endian system, byte `0x23` would be at the lowest address .
### 5.3 Signed and unsigned instructions
RISC-V distinguishes between signed and unsigned operations for certain instructions, particularly those involving multiplication, division, branches, and comparisons. This distinction is important for correctly interpreting and manipulating data .
#### 5.3.1 Multiplication
* **Signed multiplication**: `mulh` instruction .
* **Unsigned multiplication**: `mulhu` (both operands unsigned) and `mulhsu` (first operand signed, second unsigned) instructions. The lower 32 bits of the result are the same for signed and unsigned multiplication, so the `mul` instruction can be used for both if only the lower bits are needed .
#### 5.3.2 Division and remainder
* **Signed division and remainder**: `div`, `rem` instructions .
* **Unsigned division and remainder**: `divu`, `remu` instructions .
#### 5.3.3 Branches
* **Signed comparison branches**: `blt` (branch if less than), `bge` (branch if greater than or equal to) .
* **Unsigned comparison branches**: `bltu` (branch if less than unsigned), `bgeu` (branch if greater than or equal to unsigned) .
#### 5.3.4 Set less than
* **Signed set less than**: `slt`, `slti` (immediate) instructions. These set a destination register to 1 if the first operand is less than the second, and 0 otherwise .
* **Unsigned set less than**: `sltu`, `sltiu` (immediate) instructions. RISC-V always sign-extends immediate values, even for unsigned operations .
#### 5.3.5 Loads
Instructions for loading data from memory into registers can also be signed or unsigned.
* **Signed loads**: `lh` (load halfword), `lb` (load byte). These instructions sign-extend the loaded value to create a 32-bit value for the register .
* **Unsigned loads**: `lhu` (load halfword unsigned), `lbu` (load byte unsigned). These instructions zero-extend the loaded value to create a 32-bit value .
#### 5.3.6 Detecting overflow
RISC-V does not have explicit unsigned arithmetic instructions with overflow detection; however, it can be achieved using existing instructions .
* **Unsigned overflow detection**: Can be detected by comparing the result of an addition with one of the operands using `bltu` .
* **Signed overflow detection**: Requires a more complex sequence involving comparing signs of operands and results using `slti`, `slt`, and `bne` .
### 5.4 Compressed instructions
Compressed instructions are 16-bit versions of common 32-bit RISC-V instructions, designed to reduce code size. Compilers and processors can mix 32-bit and 16-bit instructions, favoring compressed versions where possible. These instructions often use a `c.` prefix, e.g., `c.add` for `add`, `c.lw` for `lw`, and `c.addi` for `addi`. Some compressed instructions use a 3-bit register code for registers `x8` to `x15`, and immediates are typically smaller (6-11 bits) .
> **Example:** A loop incrementing array elements might be implemented using compressed instructions like `c.lw`, `c.addi`, and `c.sw` for efficiency. However, if an immediate value is too large to fit in the compressed format, a non-compressed instruction like `addi` might be used instead .
### 5.5 Floating-point instructions
RISC-V supports floating-point operations through extensions:
* **RVF**: Single-precision (32-bit) floating-point .
* **RVD**: Double-precision (64-bit) floating-point .
* **RVQ**: Quad-precision (128-bit) floating-point .
Floating-point registers (`f0` to `f31`) are available, and their width is determined by the highest precision extension implemented. Lower-precision values occupy the lower bits of these registers. Common floating-point instructions include arithmetic operations (`fadd`, `fsub`, `fmul`, `fdiv`, `fsqrt`, multiply-add variants like `fmadd`), data movement (`fmv`), conversions (`fcvt`), and comparisons (`feq`, `flt`, `fle`). Precision is denoted by suffixes `.s`, `.d`, or `.q` (e.g., `fadd.s` for single-precision add). Multiply-add instructions use an R4-type format due to their four register operands .
> **Example:** The C code `scores[i = scores[i + 10` with `float scores ` would be translated into RISC-V assembly using floating-point instructions like `flw` (float load word), `fadd.s` (float add single-precision), and `fsw` (float store word) .
### 5.6 Exceptions
An exception is an unscheduled function call to an exception handler, triggered by either hardware (interrupts) or software (traps). When an exception occurs, the processor records its cause, jumps to an exception handler, and can later return to the interrupted program .
#### 5.6.1 Exception causes
Common causes of exceptions include instruction address misalignment, illegal instructions, breakpoints, load/store address misalignments, and access faults. Environment calls (`ecall`) from different privilege modes are also exceptions .
#### 5.6.2 Privilege levels
RISC-V defines privilege levels to control access to memory and privileged instructions. These levels, from highest to lowest, are :
* Machine mode (M-mode): Highest privilege, for bare-metal operation .
* Hypervisor mode (H-mode): For supporting virtual machines .
* System mode (S-mode): For operating systems .
* User mode (U-mode): Lowest privilege, for user applications .
#### 5.6.3 Exception registers (CSRs)
Each privilege level has control and status registers (CSRRs) to manage exceptions. For M-mode, these include :
* `mtvec`: Address of the exception handler .
* `mcause`: Records the cause of the exception .
* `mepc` (Exception PC): Stores the PC of the instruction that caused the exception .
* `mscratch`: A scratch register for temporary use by the exception handler .
#### 5.6.4 Exception-related instructions
Privileged instructions are used to access CSRRs:
* `csrr` (CSR read): Reads a CSR into a general-purpose register .
* `csrw` (CSR write): Writes a general-purpose register's value to a CSR .
* `csrrw` (CSR read/write): Atomically reads and writes a CSR .
* `mret` (Machine mode return): Returns from an exception handler to the address stored in `mepc` .
#### 5.6.5 Exception handler summary
When an exception occurs, the processor transfers control to the handler at the `mtvec` address. The handler typically saves its context (registers), checks `mcause` to determine the error, handles the exception, and then, using `mret`, returns to the program at the instruction pointed to by `mepc` (optionally incremented by 4) .
> **Example:** An exception handler might check if `mcause` indicates an illegal instruction (value 2) or a load address misaligned (value 4). If it's an illegal instruction, it can increment `mepc` to skip the faulty instruction before returning .
---
## Common mistakes to avoid
- Review all topics thoroughly before exams
- Pay attention to formulas and key definitions
- Practice with examples provided in each section
- Don't memorize without understanding the underlying concepts
Glossary
| Term | Definition |
|------|------------|
| Architecture | The programmer's visible view of a computer, defined by its instructions, operand locations, and memory organization. |
| Microarchitecture | The specific hardware implementation of a computer architecture, detailing how the architecture is realized in circuits. |
| Assembly Language | A low-level programming language that uses mnemonics to represent machine language instructions, making it more human-readable than binary code. |
| Machine Language | The low-level binary code (1s and 0s) that a computer's processor can directly execute. |
| RISC-V | An open-source instruction set architecture (ISA) developed at UC Berkeley, known for its simplicity and modularity. |
| Instruction | A command that a computer processor can execute to perform a specific operation. |
| Operand | A value or location that an instruction operates on. |
| Register | A small, fast storage location within the CPU used to hold data and instructions that are actively being processed. |
| Memory | A hardware component that stores data and instructions for the computer; it is generally slower than registers. |
| Immediate | A constant value that is directly encoded within an instruction itself, rather than being fetched from a register or memory. |
| Mnemonic | A symbolic name or abbreviation used in assembly language to represent a machine code instruction (e.g., `add` for addition). |
| Load Word (lw) | A RISC-V instruction used to read a 32-bit word from memory into a destination register. |
| Store Word (sw) | A RISC-V instruction used to write the value from a source register into a memory location. |
| Byte-Addressable Memory | A memory system where each individual byte has a unique address. |
| Word-Addressable Memory | A memory system where memory locations are accessed in fixed-size words (e.g., 32 bits), and each word has a unique address. |
| Base Addressing | An addressing mode where the effective memory address is calculated by adding an offset to the value in a base register. |
| PC-Relative Addressing | An addressing mode used for branches and jumps where the target address is calculated relative to the current Program Counter (PC). |
| Branch Instruction | An instruction that alters the flow of program execution based on a condition. |
| Jump Instruction | An instruction that unconditionally transfers program control to a different location in the code. |
| Stack | A region of memory used for temporary storage, particularly for function calls (storing return addresses, local variables, and parameters), operating in a Last-In, First-Out (LIFO) manner. |
| Stack Pointer (sp) | A special register that points to the current top of the stack in memory. |
| Function Call Convention | A set of rules that define how functions pass arguments, return values, and manage registers to ensure interoperability between different functions. |
| Caller | In a function call, the function that initiates the call to another function. |
| Callee | In a function call, the function that is being called. |
| Return Address (ra) | A register that stores the address to which program execution should return after a function call completes. |
| R-Type Instruction Format | A RISC-V instruction format used for operations involving three register operands, typically arithmetic and logical operations. It includes fields for opcode, function codes (funct7, funct3), and three registers (rs1, rs2, rd). |
| I-Type Instruction Format | A RISC-V instruction format used for instructions that involve a register and a 12-bit immediate value, such as `addi` and `lw`. It includes fields for opcode, function code (funct3), a source register (rs1), a destination register (rd), and the immediate value. |
| S-Type Instruction Format | A RISC-V instruction format used for store operations. It includes fields for opcode, function code (funct3), two source registers (rs1 for base address, rs2 for value), and an immediate offset. |
| B-Type Instruction Format | A RISC-V instruction format used for conditional branch instructions. It includes fields for opcode, function code (funct3), two source registers (rs1, rs2) for comparison, and a 12-bit immediate offset for the branch target. |
| U-Type Instruction Format | A RISC-V instruction format used for instructions that load a 20-bit immediate value into the upper bits of a register, such as `lui`. It includes fields for opcode, destination register (rd), and the upper immediate value. |
| J-Type Instruction Format | A RISC-V instruction format used for unconditional jump-and-link instructions (jal). It includes fields for opcode, destination register (rd, often `ra`), and a 20-bit immediate offset for the jump target. |
| Pseudoinstruction | An instruction that is not directly supported by the RISC-V hardware but is translated into one or more actual RISC-V instructions by the assembler for programmer convenience. |
| Endianness | The byte order in which multi-byte data is stored in memory. Little-endian stores the least significant byte at the lowest address, while big-endian stores the most significant byte at the lowest address. |
| Exception | An event that disrupts the normal sequential execution of instructions, such as an illegal instruction, division by zero, or an interrupt. It typically causes the processor to transfer control to a dedicated exception handler. |
| Privilege Level | A mode of operation in a processor that determines the set of instructions and memory regions a program can access. RISC-V has modes like Machine, System, and User. |
| Control and Status Register (CSR) | Special registers within the processor that control its operation or hold status information, often used in managing exceptions and privilege levels. |
| Compressed Instructions | Shorter (16-bit) versions of common RISC-V instructions, designed to reduce code size and improve instruction cache efficiency. |
| Floating-Point Extension (RVF, RVD, RVQ) | Optional sets of instructions and registers in RISC-V designed to perform arithmetic operations on floating-point numbers with single (32-bit), double (64-bit), or quad (128-bit) precision. |
Cover
DDCArv_Ch7.pdf
Summary
# Introduction to microarchitecture
Microarchitecture defines the hardware implementation of a computer architecture, detailing the processor's functional components and control mechanisms.
## 1. Introduction to microarchitecture
Microarchitecture describes the hardware implementation of a given architecture. It focuses on the internal organization and design of a processor, specifically its datapath and control units [3](#page=3).
### 1.1 Processor components
A processor, as defined by its microarchitecture, consists of two primary parts:
* **Datapath:** This comprises the functional blocks responsible for executing operations, such as arithmetic logic units (ALUs), registers, and memory access units [3](#page=3).
* **Control:** This unit generates the necessary control signals to orchestrate the operations of the datapath, ensuring instructions are executed in the correct sequence and with the appropriate data flow [3](#page=3).
### 1.2 Implementation strategies
A single architecture can be implemented using various microarchitectural strategies, each with different performance characteristics. The primary strategies include:
* **Single-cycle processor:** In this approach, each instruction is designed to complete its execution within a single clock cycle. While conceptually simple, this often leads to longer clock cycle times as the cycle must be long enough to accommodate the slowest instruction [4](#page=4).
* **Multicycle processor:** This strategy breaks down the execution of each instruction into a series of shorter, sequential steps or cycles. This allows for a shorter clock cycle time compared to a single-cycle processor, as each step is simpler and takes less time. However, instructions still execute sequentially, one after another [4](#page=4).
* **Pipelined processor:** This is an advanced implementation that further breaks down instructions into a series of steps, similar to the multicycle approach. The key difference is that multiple instructions can be in different stages of execution simultaneously, overlapping their execution. This significantly increases instruction throughput, even though individual instruction latency might not decrease [4](#page=4).
> **Tip:** Understanding these different implementation strategies is crucial for appreciating the trade-offs between performance, complexity, and hardware cost in processor design.
---
# Single-cycle processor implementation
This section details the design of a single-cycle RISC-V processor, explaining its datapath and control logic to execute various instruction types, and analyzes its performance limitations [9](#page=9).
### 2.1 Overview of the single-cycle processor
The single-cycle processor aims to execute an instruction in a single clock cycle. This requires designing a datapath that can perform all necessary operations for any instruction within that cycle, and a control unit that generates the appropriate signals for each instruction. The datapath is composed of various hardware components such as the Program Counter (PC), Instruction Memory, Register File, ALU, and Data Memory [10](#page=10).
### 2.2 Example program execution
To understand the datapath's operation, an example program with `lw`, `sw`, `or`, and `beq` instructions is used [11](#page=11).
* **`lw x6, -4(x9)` (I-Type)**: This instruction loads a word from memory into register `x6`. The memory address is calculated by adding the immediate value `-4` to the content of register `x9` [11](#page=11) [12](#page=12).
* **Step 1: Fetch instruction:** The instruction at address `0x1000` is fetched from the Instruction Memory [13](#page=13).
* **Step 2: Read source operand (rs1):** The value from register `x9` (specified by `rs1` field) is read from the Register File [14](#page=14).
* **Step 3: Extend the immediate:** The 12-bit immediate value from the instruction is sign-extended to 32 bits. For `lw`, the immediate is `0xFFC` [15](#page=15).
* **Step 4: Compute the memory address:** The value of `rs1` is added to the sign-extended immediate using the ALU. The ALU operation is `add` (control signal `ALUControl = 000`). The result is `0x00002000` [16](#page=16).
* **Step 5: Read data from memory:** The computed address `0x00002000` is used to read data from the Data Memory. The data read is `0x10` (assuming this is the content at that address). This data will be written back to the Register File [17](#page=17).
* **Step 6: Determine the address of the next instruction:** The PC is incremented by 4 to point to the next instruction (`0x1004`) [18](#page=18).
* **Write back to Register File:** The data read from memory (`ReadData`) is written to the destination register `rd` (`x6`). This requires the `RegWrite` control signal to be asserted [17](#page=17).
* **`sw x6, 8(x9)` (S-Type)**: This instruction stores a word from register `x6` into memory. The memory address is calculated by adding the immediate value `8` to the content of register `x9` [11](#page=11) [20](#page=20).
* The immediate value is formed by concatenating bits `[31:25]` and `[11:7]` of the instruction [21](#page=21).
* The `MemWrite` control signal is asserted to enable writing to the Data Memory [20](#page=20).
* The value from `rs2` (register `x6`) is used as the `WriteData` [20](#page=20).
* **`or x4, x5, x6` (R-Type)**: This instruction performs a bitwise OR operation between the contents of registers `x5` (`rs1`) and `x6` (`rs2`), and stores the result in register `x4` (`rd`) [11](#page=11) [22](#page=22).
* For R-type instructions, both source operands (`rs1` and `rs2`) are read from the Register File [22](#page=22).
* The ALU performs the `or` operation (control signal `ALUControl = 011`) [22](#page=22) [29](#page=29).
* The `ALUSrc` control signal is set to `0`, indicating that the second ALU operand comes from the Register File (`rs2`) [22](#page=22).
* The `ResultSrc` control signal is set to `0` to select the `ALUResult` as the data to be written back to the `rd` register [22](#page=22).
* **`beq x4, x4, L7` (B-Type)**: This instruction branches to the label `L7` if the values in registers `x4` and `x4` are equal [11](#page=11) [23](#page=23).
* The ALU performs a subtraction between the two source operands (`rs1` and `rs2`). If the result is zero, the `Zero` flag is set [23](#page=23) [29](#page=29).
* The target address is calculated by adding the PC (which is `PC+4`) to the sign-extended immediate value [23](#page=23).
* The `PCSrc` control signal determines whether the PC is updated with `PC+4` or the calculated `PCTarget`. If the `Zero` flag is asserted and the `Branch` control signal is active (for `beq`), `PCSrc` is `1`, selecting `PCTarget` as the next PC value [23](#page=23).
### 2.3 Immediate value extension
The single-cycle processor needs to handle different immediate formats for various instruction types (I, S, B, J). An `ImmSrc` control signal selects the appropriate immediate generation logic [21](#page=21) [24](#page=24).
* **I-Type:** Immediate is bits `[31:20]`, sign-extended [21](#page=21).
* **S-Type:** Immediate is formed by concatenating bits `[31:25]` and `[11:7]`, sign-extended [21](#page=21).
* **B-Type:** Immediate is formed by specific bit selections (` `, ` `, `[30:25]`, `[11:8]`, `[19:12]`), sign-extended, and the least significant bit is `0` [24](#page=24) [31](#page=31) [7](#page=7).
* **J-Type (for `jal`):** Immediate is formed by specific bit selections (`[19:12]`, ` `, `[30:21]`), sign-extended, and the least significant bit is `0` [20](#page=20) [40](#page=40).
### 2.4 Control Unit
The control unit decodes the instruction's `op` (opcode) and other relevant fields (like `funct3`, `funct7`) to generate control signals for the datapath [27](#page=27).
* **Main Decoder:** Decodes the `op` field to determine the instruction type and generate initial control signals like `RegWrite`, `ImmSrc`, `ALUSrc`, `MemWrite`, `ResultSrc`, `Branch`, and `ALUOp` [28](#page=28).
* `lw` (op=3): `RegWrite=1`, `ImmSrc=00`, `ALUSrc=1`, `MemWrite=0`, `ResultSrc=1`, `Branch=0`, `ALUOp=00` [28](#page=28).
* `sw` (op=35): `RegWrite=0`, `ImmSrc=01`, `ALUSrc=1`, `MemWrite=1`, `ResultSrc=X`, `Branch=0`, `ALUOp=00` [28](#page=28).
* R-type (op=51): `RegWrite=1`, `ImmSrc=XX`, `ALUSrc=0`, `MemWrite=0`, `ResultSrc=0`, `Branch=0`, `ALUOp=10` [28](#page=28).
* `beq` (op=99): `RegWrite=0`, `ImmSrc=10`, `ALUSrc=0`, `MemWrite=0`, `ResultSrc=X`, `Branch=1`, `ALUOp=01` [28](#page=28).
* I-Type ALU instructions (e.g., `addi`, `andi`, `ori`, `slti`) have `op=19`. They are similar to R-type but use the immediate as the second source operand. `RegWrite=1`, `ImmSrc=00`, `ALUSrc=1`, `MemWrite=0`, `ResultSrc=0`, `Branch=0`, `ALUOp=10` [35](#page=35) [36](#page=36).
* `jal` (op=111): `RegWrite=1`, `ImmSrc=11`, `ALUSrc=X`, `MemWrite=0`, `ResultSrc=10`, `Branch=0`, `ALUOp=XX`, `Jump=1` [41](#page=41).
* **ALU Decoder:** Based on `ALUOp` (from the main decoder), `funct3`, and `funct7` (for R-type), this decoder generates the 3-bit `ALUControl` signal to select the specific ALU operation [31](#page=31) [32](#page=32).
* `add`: `ALUControl = 000` [29](#page=29).
* `subtract`: `ALUControl = 001` [29](#page=29).
* `and`: `ALUControl = 010` [29](#page=29).
* `or`: `ALUControl = 011` [29](#page=29).
* `slt` (set less than): `ALUControl = 101` [29](#page=29).
### 2.5 Extending the processor
The single-cycle processor can be extended to handle more instruction types [34](#page=34).
* **I-Type ALU instructions (`addi`, `andi`, `ori`, `slti`):** These instructions are similar to R-type but use the immediate value as the second source operand for the ALU. This requires asserting `ALUSrc` and selecting the appropriate immediate using `ImmSrc` [35](#page=35) [36](#page=36) [37](#page=37).
* **`jal` (Jump and Link):** This instruction is similar to `beq` in that it uses an immediate for calculating a target address and always updates the PC. However, the jump is always taken. It also stores the `PC+4` address in the destination register `rd`. This requires a new `ImmSrc` value and `ResultSrc` to select `PC+4` [11](#page=11) [38](#page=38) [39](#page=39) [41](#page=41) [42](#page=42).
### 2.6 Performance limitations
The primary performance limitation of a single-cycle processor is that the clock cycle time (`Tc`) is determined by the longest delay path, known as the critical path, which is usually associated with the `lw` instruction [43](#page=43) [45](#page=45).
* **Program Execution Time:** This is calculated as:
$$ \text{Execution Time} = (\text{# instructions}) \times (\text{cycles/instruction}) \times (\text{seconds/cycle}) $$
$$ \text{Execution Time} = \# \text{instructions} \times \text{CPI} \times \text{TC} $$
where CPI is Cycles Per Instruction. In a single-cycle processor, CPI is always 1 [44](#page=44).
* **Critical Path Calculation:** The critical path for a `lw` instruction includes:
$$ T_{C\_\text{single}} = t_{\text{pcq\_PC}} + t_{\text{mem}} + \max[t_{\text{RFread}}, t_{\text{dec}} + t_{\text{ext}} + t_{\text{mux}}] + t_{\text{ALU}} + t_{\text{mem}} + t_{\text{mux}} + t_{\text{RFsetup}} $$
This can be simplified by considering the dominant delays:
$$ T_{C\_\text{single}} = t_{\text{pcq\_PC}} + 2t_{\text{mem}}+ t_{\text{RFread}} + t_{\text{ALU}} + t_{\text{mux}} + t_{\text{RFsetup}} $$
* **Example Calculation:** Using typical delay values:
* $t_{\text{pcq\_PC}} = 40 \text{ ps}$
* $t_{\text{mem}} = 200 \text{ ps}$
* $t_{\text{RFread}} = 100 \text{ ps}$
* $t_{\text{ALU}} = 120 \text{ ps}$
* $t_{\text{mux}} = 30 \text{ ps}$
* $t_{\text{RFsetup}} = 60 \text{ ps}$
$$ T_{C\_\text{single}} = (40 + 2 \times 200 + 100 + 120 + 30 + 60) \text{ ps} = 750 \text{ ps} $$
* **Implication:** If a program has 100 billion instructions, the total execution time would be:
$$ \text{Execution Time} = (100 \times 10^9) \times \times (750 \times 10^{-12} \text{ s}) = 75 \text{ seconds} $$ [1](#page=1).
This highlights that all instructions, regardless of their complexity (e.g., a simple R-type instruction that takes much less time), must wait for the slowest instruction (`lw`) to complete within a single clock cycle. This inefficiency motivates the development of multi-cycle or pipelined processors [48](#page=48).
---
# Multicycle processor implementation
Multicycle processor designs break down instruction execution into a series of smaller steps, allowing for hardware reuse and a faster clock cycle compared to single-cycle processors [50](#page=50).
### 3.1 Overview of multicycle processors
The multicycle processor approach aims to improve performance by dividing instruction execution into multiple clock cycles, where each cycle performs a specific micro-operation. This contrasts with single-cycle processors, where each instruction completes in a single, albeit potentially long, clock cycle [50](#page=50).
Key advantages of multicycle processors include:
* **Higher clock speed:** The cycle time is determined by the longest stage within an instruction's execution, not the entire instruction's longest possible path [51](#page=51).
* **Faster execution for simpler instructions:** Instructions that require fewer steps complete in less time [51](#page=51).
* **Hardware reuse:** Expensive functional units, such as memory, ALU, and register file, can be shared across different clock cycles for different operations, leading to a potentially simpler and more cost-effective design [51](#page=51).
However, there is a sequencing overhead that is paid for each instruction, regardless of its complexity. The design process for a multicycle processor follows similar steps to a single-cycle processor: first, design the datapath, and then design the control unit [51](#page=51).
### 3.2 Multicycle datapath components
The multicycle processor can utilize a single, unified memory for both instructions and data, which is more realistic than separate memories used in some single-cycle designs. Key state elements include the Program Counter (PC), Instruction Register (IR), Register File, and a unified Instruction/Data Memory [52](#page=52).
The execution of an instruction is broken down into stages, illustrated here with the `lw` (load word) instruction:
* **Step 1: Fetch instruction**
The PC provides the address to fetch the instruction from memory. The fetched instruction is loaded into the Instruction Register (IR). The PC is also updated to PC+4, preparing for the next instruction fetch [53](#page=53).
* **Step 2: Read source operand(s) and extend immediate**
For instructions like `lw`, the immediate value is extended. The required source register(s) (e.g., `Rs1`) are read from the Register File [54](#page=54).
* **Step 3: Compute memory address**
The memory address for `lw` is calculated by adding the base register value (`Rs1`) to the sign-extended immediate value. This calculation is performed by the ALU [55](#page=55).
* **Step 4: Read data from memory**
The calculated address is used to read data from the Instruction/Data Memory [56](#page=56).
* **Step 5: Write data back to register file**
The data read from memory is written back to the destination register (`Rd`) in the Register File [57](#page=57).
* **Step 6: Increment PC**
The PC is incremented to PC+4. This step is fundamental to all instruction types for fetching the next instruction sequentially [58](#page=58).
#### 3.2.1 Datapath for other instructions
The datapath is designed to accommodate various instruction types by controlling the flow of data and the operation of functional units:
* **`sw` (store word):** This instruction requires calculating the memory address, reading the data from a register (`Rs2`), and then writing that data to memory at the computed address. It does not write back to the register file [60](#page=60).
* **`beq` (branch if equal):** To handle branches, the processor needs to compute the branch target address (BTA) which is `PC + imm`. It also needs to compare two source registers (`Rs1` and `Rs2`) using the ALU. The `Zero` output of the ALU is crucial here; if it's high, the PC is updated to the BTA. The current PC value needs to be preserved as `OldPC` for calculating the target address if the branch is taken [61](#page=61).
* **R-Type instructions:** These instructions involve reading two source registers, performing an ALU operation (e.g., addition, subtraction), and writing the result back to a destination register. The ALU operation is determined by `funct3` and `funct7` fields [80](#page=80) [81](#page=81).
* **I-Type ALU instructions (e.g., `addi`):** These instructions read a source register, read an immediate value, perform an ALU operation, and write the result back to a destination register [90](#page=90).
* **`jal` (jump and link):** This instruction calculates the target address (`PC + 4`) and writes the PC+4 value into the destination register (`rd`). The PC is then updated to the target address [92](#page=92) [93](#page=93).
### 3.3 Multicycle control unit
The control unit is responsible for generating the control signals that orchestrate the datapath operations across the multiple clock cycles. It consists of an Instruction Decoder, an ALU Decoder, and a Main Finite State Machine (FSM) [64](#page=64).
* **Instruction Decoder:** This component decodes the `op` field of the instruction to determine the instruction type and generate signals like `ImmSrc` to control immediate value extension [65](#page=65).
* **ALU Decoder:** This unit, similar to the single-cycle design, decodes the `op` and `funct3`/`funct7` fields to generate control signals for the ALU (`ALUControl`) [64](#page=64).
* **Main FSM:** This is the core of the control unit. It sequences through states to execute an instruction, generating signals such as `ALUSrcA`, `ALUSrcB`, `ALUOp`, `ResultSrc`, `RegWrite`, `MemWrite`, `IRWrite`, `PCWrite`, `AdrSrc`, and `Branch` [66](#page=66).
#### 3.3.1 Finite State Machine (FSM) for instruction execution
The Main FSM manages the states for executing different instruction types. Each state is associated with specific control signal values for a particular clock cycle.
* **Fetch (S0):** Reads instruction from memory, enables `IRWrite`, and sets `AdrSrc` to 0 to output PC. It also calculates PC+4 using the ALU while the ALU is not otherwise used, and sets `PCUpdate` [67](#page=67) [74](#page=74) [75](#page=75).
* **Decode (S1):** Reads source registers (`Rs1`, `Rs2`) from the register file. For branches and jumps, it also calculates the target address (`PC + imm`) using the ALU [68](#page=68) [84](#page=84) [85](#page=85).
* **Memory Address (S2):** Computes the memory address for load and store operations by adding the base register value to the sign-extended immediate [69](#page=69).
* **Memory Read (S3):** Reads data from memory using the computed address for `lw` instructions [70](#page=70) [71](#page=71).
* **Memory Write Back (S4):** Writes the data read from memory back to the register file for `lw` instructions [72](#page=72) [73](#page=73).
* **Memory Write (S5):** Writes data from `Rs2` to memory for `sw` instructions [77](#page=77) [78](#page=78).
* **R-Type Execute (S6):** Performs the ALU operation specified by the instruction on source registers [79](#page=79) [80](#page=80).
* **ALU Write Back (S7):** Writes the result of an R-type ALU operation back to the destination register [81](#page=81) [82](#page=82).
* **I-Type ALU Execute (S8):** Performs the ALU operation for I-type instructions, involving a register and an immediate value [89](#page=89) [90](#page=90).
* **JAL (S9):** Calculates `PC + 4`, writes this value to `rd`, and updates the PC to the jump target address [91](#page=91) [92](#page=92).
* **BEQ (S10):** Compares `Rs1` and `Rs2`. If they are equal (ALU `Zero` output is 1), it updates the PC to the branch target address; otherwise, it proceeds to the next sequential instruction [86](#page=86) [87](#page=87).
> **Tip:** The FSM simplifies control signal management by defaulting most signals to 0 or "don't care" if not explicitly listed for a state, reducing the complexity of the control logic [66](#page=66).
### 3.4 Performance considerations
Multicycle processors exhibit variable execution times for different instructions because they take a varying number of cycles to complete [96](#page=96).
* **Instruction cycle counts:**
* `beq`: 3 cycles [96](#page=96).
* R-type, `addi`, `sw`, `jal`: 4 cycles [96](#page=96).
* `lw`: 5 cycles [96](#page=96).
* **Cycles Per Instruction (CPI):** The overall CPI is a weighted average based on the frequency of each instruction type in a program. For example, using the SPECINT2000 benchmark [96](#page=96):
Average CPI = (0.13 * 3) + ((0.52 + 0.10) * 4) + (0.25 * 5) = 4.12 cycles/instruction [96](#page=96).
* **Critical Path:** The critical path in a multicycle processor determines the clock cycle time. It is typically bounded by the longest delay among the stages. A common formula for the multicycle clock cycle time ($T_{c\_multi}$) is:
$$T_{c\_multi} = t_{pcq} + t_{dec} + 2 \times t_{mux} + \max(t_{ALU}, t_{mem}) + t_{setup}$$
where:
* $t_{pcq}$: PC clock-to-Q delay [98](#page=98).
* $t_{dec}$: Decoder delay (control unit) [98](#page=98).
* $t_{mux}$: Multiplexer delay [98](#page=98).
* $t_{ALU}$: ALU delay [98](#page=98).
* $t_{mem}$: Memory read delay [98](#page=98).
* $t_{setup}$: Register setup time [98](#page=98).
A sample calculation based on provided delays yields $T_{c\_multi} = 375$ ps [99](#page=99).
* **Performance Example:** For a program with 100 billion instructions, a CPI of 4.12, and a clock cycle time of 375 ps, the total execution time would be:
Execution Time = (100 × $10^9$) × 4.12 × (375 × $10^{-12}$) seconds = 155 seconds. This can be slower than a single-cycle processor if the single-cycle processor's clock speed is significantly higher [100](#page=100).
> **Tip:** While multicycle processors offer hardware savings and better clock speeds than a naive single-cycle design, their overall performance is dictated by the CPI and clock cycle time. Optimizing the critical path and reducing the number of cycles for frequent instructions are key to improving performance.
---
# Pipelined processor implementation and hazards
This section details the implementation of a pipelined processor and the various hazards that can arise, along with strategies for their resolution.
### 4.1 Pipelined processor implementation
Pipelining is a technique used to improve the throughput of a processor by overlapping the execution of multiple instructions. Instead of executing one instruction completely before starting the next (as in a single-cycle processor), pipelining divides the instruction execution into several distinct stages. These stages are then executed concurrently for different instructions .
#### 4.1.1 The 5-stage pipeline
The common 5-stage pipeline architecture consists of the following stages:
1. **Fetch (IF):** Fetches the instruction from memory based on the Program Counter (PC).
2. **Decode (ID):** Decodes the instruction and reads the required operands from the register file.
3. **Execute (EX):** Performs the arithmetic or logical operation using the ALU.
4. **Memory (MEM):** Accesses data memory for loads or stores.
5. **Writeback (WB):** Writes the result back to the register file.
Pipeline registers are introduced between these stages to hold the intermediate results and control signals for each instruction as it progresses through the pipeline. This allows different instructions to be in different stages simultaneously .
#### 4.1.2 Single-cycle vs. Pipelined execution
In a single-cycle processor, each instruction takes one clock cycle to complete, and the clock cycle must be long enough to accommodate the slowest instruction (the critical path). This leads to underutilization of hardware, as simpler instructions take the same amount of time as complex ones .
Pipelining, on the other hand, breaks down instruction execution into smaller stages, each taking one clock cycle. The clock cycle time is determined by the longest stage (the pipelined critical path). This allows for a higher throughput, as ideally, one instruction completes every clock cycle, even though individual instruction latency might increase slightly due to pipeline register overhead .
* **Single-Cycle Execution:** An instruction occupies all stages for the entire clock cycle.
* **Pipelined Execution:** Different instructions occupy different stages in each clock cycle. For example, in cycle 2, the first instruction might be in the Decode stage, while the second instruction is in the Fetch stage .
#### 4.1.3 Pipelined datapath
The datapath of a pipelined processor includes pipeline registers between each stage to store stage-specific information. Control signals are also propagated through the pipeline, with each instruction "dropping off" its control signals when they are no longer needed. Signals are often appended with the first letter of the stage they belong to (e.g., `PCF` for PC in Fetch, `PCD` for PC in Decode) .
The corrected pipelined datapath ensures that register file reads occur before the write to avoid conflicts. The register file is typically written on the falling edge of the clock .
#### 4.1.4 Control signals in pipelined processors
The control unit for a pipelined processor is similar to that of a single-cycle processor, but its control signals are passed through the pipeline registers. These signals, such as `RegWrite`, `MemWrite`, `ALUSrc`, `ALUControl`, and `ResultSrc`, travel with the instruction and are asserted in the appropriate stages .
### 4.2 Pipelined processor hazards
Hazards are conditions that prevent the next instruction in the instruction stream from executing during its normal clock cycle. They can cause a pipeline to deviate from its ideal behavior of completing one instruction per cycle .
#### 4.2.1 Data hazards
Data hazards occur when an instruction depends on the result of a previous instruction that has not yet completed and written its result back to the register file. There are three main types :
1. **RAW (Read After Write):** An instruction tries to read a register before a previous instruction has written to it. This is the most common type.
2. **WAR (Write After Read):** An instruction tries to write to a register that a previous instruction has already read. This is less common in simple pipelines due to the sequential nature of WB.
3. **WAW (Write After Write):** Two instructions try to write to the same register. The pipeline must ensure the writes occur in the correct program order.
**Example:** Consider an `add` instruction followed by a `sub` instruction. If the `sub` instruction needs the result of the `add` instruction, and the `add` instruction has not yet written its result to the register file by the time `sub` needs it, a data hazard occurs .
#### 4.2.2 Handling data hazards
Several techniques can be employed to handle data hazards:
* **Compile-time techniques:**
* **Inserting NOPs (No-Operation instructions):** The compiler inserts dummy instructions to create delays, allowing the required data to become available .
* **Code reordering:** The compiler rearranges independent instructions to execute before the dependent instruction, filling the pipeline stalls .
* **Run-time techniques:**
* **Data Forwarding (or Bypassing):** This is a hardware-based solution where the result of an instruction is forwarded from the output of an earlier stage (e.g., Execute or Memory) directly to the input of a later stage (e.g., Execute) before it is written back to the register file. This significantly reduces stalls .
* The hazard unit checks if the source registers (`Rs1E`, `Rs2E`) for the instruction in the Execute stage match the destination registers (`RdM`, `RdW`) of instructions in the Memory or Writeback stages .
* If a match is found and the previous instruction is writing to the register (`RegWriteM` or `RegWriteW` is asserted), the data is forwarded .
* **Forwarding logic:**
* **Case 1 (Forward from Memory):** If `Rs1E` (or `Rs2E`) matches `RdM` and `RegWriteM` is asserted, forward from the Memory stage. `ForwardAE` (or `ForwardBE`) is set to `10` .
* **Case 2 (Forward from Writeback):** If `Rs1E` (or `Rs2E`) matches `RdW` and `RegWriteW` is asserted, forward from the Writeback stage. `ForwardAE` (or `ForwardBE`) is set to `01` .
* **Case 3 (No Forwarding):** Otherwise, read from the register file. `ForwardAE` (or `ForwardBE`) is set to `00` .
* **Exclusion of zero register:** The forwarding logic typically includes a check `(Rs1E!= 0)` or `(Rs2E!= 0)` to prevent forwarding from or to register `x0` (zero register) .
* **Stalling (or Pipeline Bubbles):** If forwarding is not possible (e.g., for a load instruction where the data is not available until the Memory stage and needed in the Execute stage of the immediately following instruction), the pipeline must stall .
* **Load-Use Hazard:** A common scenario is when an instruction immediately follows a `lw` instruction and uses the loaded data. The data from memory is only available at the end of the Memory stage, but it's needed in the Execute stage of the next instruction. This requires a stall .
* **Stalling Logic:** A stall is typically triggered if a source register in the Decode stage (`Rs1D` or `Rs2D`) matches the destination register in the Execute stage (`RdE`), and the instruction in the Execute stage is a `lw` (indicated by `ResultSrcE0` being `0` for MEM stage result) .
* The logic `lwStall = ((Rs1D == RdE) OR (Rs2D == RdE)) AND ResultSrcE0` determines if a stall is necessary .
* When a stall occurs, the Fetch and Decode stages are frozen (`StallF = StallD = lwStall`), and the instruction in the Execute stage is flushed or invalidated (`FlushE = lwStall`) .
#### 4.2.3 Control hazards
Control hazards arise from branches and jumps, where the sequence of instruction execution changes based on a condition. The processor fetches instructions sequentially, but a branch instruction might cause it to jump to a different address. The decision about whether to branch is often made late in the pipeline (e.g., in the Execute stage), meaning instructions speculatively fetched after the branch might need to be discarded .
**Example:** A `beq` instruction whose condition is evaluated in the Execute stage. If the branch is taken, the subsequent instructions fetched into the pipeline must be flushed .
* **Branch Misprediction Penalty:** The number of instructions that are flushed when a branch is taken is called the branch misprediction penalty. In a 5-stage pipeline, if the branch decision is made in the Execute stage, typically 2 instructions (Fetch and Decode stages) are flushed .
#### 4.2.4 Handling control hazards
Methods to mitigate control hazards include:
* **Branch Prediction:** The processor speculatively predicts whether a branch will be taken or not taken. If the prediction is correct, no stall occurs. If incorrect, the pipeline must be flushed. Sophisticated branch predictors exist (e.g., static prediction, dynamic prediction).
* **Delayed Branch:** The instruction immediately following the branch is executed regardless of the branch outcome. This instruction is placed in the "branch delay slot" and must be useful to the program .
* **Flushing:** If a branch is taken (or mispredicted), the instructions that have been fetched but not yet retired must be discarded or "flushed" from the pipeline .
* **Flushing Logic:** If a branch is taken in the Execute stage (`PCSrcE` is asserted), the instructions in the Fetch and Decode stages need to be flushed. This is achieved by clearing the Decode and Execute pipeline registers using `FlushD` and `FlushE` signals .
* The flushing logic is: `FlushD = PCSrcE` and `FlushE = lwStall OR PCSrcE`. This means that if there's a branch taken (`PCSrcE`) or a load-use stall (`lwStall`) that requires flushing the execute stage, `FlushE` becomes active. `FlushD` is active if a branch is taken .
> **Tip:** Effective hazard handling is crucial for achieving near-ideal performance from a pipelined processor. While forwarding reduces stalls significantly for data hazards, load-use hazards and control hazards often necessitate stalls or flushing.
### 4.3 Pipelined processor performance
The performance of a pipelined processor is measured by its average CPI (Cycles Per Instruction) and execution time.
#### 4.3.1 Performance Metrics
* **Average CPI:** Ideally, a pipelined processor achieves a CPI of 1. However, hazards cause stalls, increasing the average CPI.
* **Example Calculation:** For a SPECINT2000 benchmark with 25% loads, 10% stores, 13% branches, and 52% R-type instructions:
* Assume 40% of loads cause a stall (CPI = 2 for loads, 1.4 average: `1*(0.6) + 2*(0.4) = 1.4`).
* Assume 50% of branches mispredict (CPI = 3 for branches, 2 average: `1*(0.5) + 3*(0.5) = 2`).
* Average CPI = `(0.25 load CPI) + (0.1 store CPI) + (0.13 branch CPI) + (0.52 R-type CPI)`
* Average CPI = `(0.25 * 1.4) + (0.1 * 1) + (0.13 * 2) + (0.52 * 1) = 0.35 + 0.1 + 0.26 + 0.52 = 1.23` .
* **Clock Cycle Time (`Tc`):** The clock cycle time of a pipelined processor is determined by the longest delay of any single stage, plus pipeline register overhead (clock-to-Q and setup time) .
* $T_{c\_pipelined} = \max(\text{Stage Delays}) + t_{pcq} + t_{setup}$ .
* The critical path is often determined by the Execute stage, which includes ALU operations and multiplexers .
* Example calculation for the critical path in the Execute stage: $T_{c\_pipelined} = (t_{pcq} + 4t_{mux} + t_{ALU} + t_{AND-OR} + t_{setup}) = (40 + 4*30 + 120 + 20 + 50) \text{ ps} = 350 \text{ ps}$ .
* **Execution Time:**
* Execution Time = (# instructions) × CPI × $T_c$ .
* For 100 billion instructions, an average CPI of 1.23, and a $T_c$ of 350 ps:
* Execution Time = $(100 \times 10^9) \times (1.23) \times (350 \times 10^{-12} \text{ seconds}) = 43 \text{ seconds}$ .
#### 4.3.2 Performance Comparison
Pipelining significantly improves performance compared to single-cycle or multicycle processors due to increased throughput .
* **Single-cycle:** Baseline, high latency per instruction.
* **Multicycle:** Better than single-cycle but still less efficient than pipelined.
* **Pipelined:** Offers substantial speedup by overlapping execution, despite increased complexity and potential stalls due to hazards .
---
# Advanced microarchitecture techniques
Advanced microarchitecture techniques focus on enhancing processor performance beyond basic pipelining by employing a variety of methods to execute instructions more efficiently and concurrently .
### 5.1 Deep pipelining
Deep pipelining involves increasing the number of stages in a processor's pipeline, typically to 10-20 stages. The number of stages is limited by factors such as pipeline hazards, sequencing overhead, power consumption, and cost .
### 5.2 Micro-operations
Complex instructions are decomposed into a series of simpler instructions known as micro-operations (micro-ops or µ-ops). At runtime, complex instructions are decoded into one or more micro-ops. This technique is heavily used in CISC (complex instruction set computer) architectures, such as x86. For example, a `lw` instruction with a post-increment operation can be broken down into a `lw` micro-op and an `addi` micro-op. Without micro-ops, this would necessitate a second write port on the register file .
> **Tip:** Micro-operations allow complex instructions to be handled by simpler hardware units, improving design flexibility and potentially performance.
### 5.3 Branch prediction
Branch prediction is a crucial technique to mitigate the performance penalty caused by branches in pipelined processors. It involves guessing whether a branch will be taken to avoid flushing the pipeline .
* **Static branch prediction:** This method determines the branch direction based on whether the branch is forward or backward. Backward branches (often found in loops) are typically predicted as taken, while forward branches are predicted as not taken .
* **Dynamic branch prediction:** This approach uses historical data of recent branch executions to make more accurate predictions. A branch target buffer (BTB) stores the destination and taken status of the last several hundred or thousand branches .
* **1-bit branch predictor:** Remembers the outcome of the last execution of a branch and predicts the same outcome for the next execution. This predictor mispredicts the first and last branches of a loop .
* **2-bit branch predictor:** Uses a state machine with four states (Strongly Taken, Weakly Taken, Weakly Not Taken, Strongly Not Taken) to track branch behavior. This improves prediction accuracy, only mispredicting the last branch of a loop in the provided example .
> **Example:** In a loop that executes 10 times, a 1-bit predictor will mispredict the branch condition the first time it's encountered (predicting not taken when it should be taken) and the last time (predicting taken when it should be not taken). A 2-bit predictor reduces this to only one misprediction.
### 5.4 Superscalar and out-of-order processors
These architectures aim to increase Instruction-Level Parallelism (ILP) by executing multiple instructions concurrently .
#### 5.4.1 Superscalar processors
Superscalar processors feature multiple copies of datapath units that can execute several instructions simultaneously. The primary challenge is managing dependencies between instructions, which can prevent multiple instructions from being issued in the same clock cycle .
> **Tip:** The ideal Instruction Per Cycle (IPC) for a superscalar processor is equal to the number of instructions it can issue per cycle. Actual IPC is often lower due to dependencies.
#### 5.4.2 Out-of-order (OOO) processors
OOO processors overcome limitations of in-order execution by looking ahead at multiple instructions and issuing them as soon as their operands are available and functional units are free, even if it's out of program order. This is done while respecting data dependencies :
* **RAW (Read After Write):** A later instruction reads a register that an earlier instruction writes to .
* **WAR (Write After Read):** A later instruction writes to a register that an earlier instruction reads .
* **WAW (Write After Write):** A later instruction writes to a register that an earlier instruction also writes to .
OOO processors utilize a scoreboard to track instructions waiting to issue, available functional units, and detected dependencies. This allows them to achieve higher actual IPC than in-order superscalar processors when dependencies cause stalls .
> **Example:** If instruction A writes to register `r1` and instruction B reads from `r1`, instruction B cannot execute until instruction A has completed its write. An OOO processor can schedule other independent instructions to execute while waiting for `r1` to be ready.
#### 5.4.3 Register renaming
Register renaming is a technique used in OOO processors to eliminate WAR and WAW hazards. It achieves this by using a larger set of physical registers than architectural registers. When an instruction needs to write to a register, it's assigned a new, free physical register. This effectively separates the read and write operations, preventing false dependencies .
> **Tip:** Register renaming is critical for enabling effective out-of-order execution by resolving naming conflicts among instructions.
### 5.5 SIMD (Single Instruction Multiple Data)
SIMD is an architecture where a single instruction operates on multiple pieces of data simultaneously. This is common in graphics processing and for accelerating short arithmetic operations, also known as packed arithmetic .
> **Example:** A single SIMD instruction could add eight 8-bit integer elements from two separate data sets, performing eight additions concurrently.
### 5.6 Multithreading and Multiprocessors
These techniques focus on parallelism at a higher level than instruction execution .
#### 5.6.1 Multithreading
Multithreading allows a processor to handle multiple threads of execution concurrently. This means that multiple copies of the architectural state exist. When one thread stalls (e.g., waiting for memory), another thread can immediately start executing, utilizing idle execution units and increasing overall throughput. Intel's implementation is known as "hyperthreading" .
* **Process:** A program running on a computer .
* **Thread:** A part of a program. A process can have multiple threads; for instance, a word processor might have threads for typing, spell checking, and printing .
* **Context Switching:** In a single-core system, when one thread stalls, its architectural state is saved, and another thread's state is loaded for execution .
> **Tip:** Multithreading improves processor utilization and throughput, especially in workloads with frequent I/O or memory access stalls, but it does not increase the ILP of a single thread.
#### 5.6.2 Multiprocessors
Multiprocessors involve multiple processors (cores) on a single chip that communicate with each other. Types include :
* **Homogeneous:** Multiple cores sharing main memory .
* **Heterogeneous:** Cores designed for different tasks, like a DSP and a CPU in a mobile device .
* **Clusters:** Each core has its own memory system .
---
## Common mistakes to avoid
- Review all topics thoroughly before exams
- Pay attention to formulas and key definitions
- Practice with examples provided in each section
- Don't memorize without understanding the underlying concepts
Glossary
| Term | Definition |
|------|------------|
| Microarchitecture | The specific hardware implementation of a computer architecture, detailing how an architecture is realized in silicon. It includes the design of the processor's datapath and control logic. |
| Architecture | The abstract model of a computer, defining the instruction set, registers, memory addressing, and I/O that a programmer interacts with. It does not specify the hardware implementation details. |
| Datapath | The part of a processor that performs data processing operations, including functional units like ALUs, registers, and multiplexers, and the pathways between them. |
| Control Unit | The component of a processor that generates control signals to direct the operation of the datapath and other processor components based on the decoded instructions. |
| Single-cycle processor | A processor design where each instruction completes execution in a single clock cycle, with the cycle time determined by the longest instruction path. |
| Multicycle processor | A processor design that breaks down instruction execution into multiple clock cycles, allowing for shorter clock periods and hardware reuse. |
| Pipelined processor | A processor design that overlaps the execution of multiple instructions by dividing instruction processing into sequential stages, with each stage operating on a different instruction concurrently. |
| Clock Period (Tc) | The duration of a single clock cycle, measured in seconds. It is the inverse of the clock frequency ($Tc = 1/f_{clock}$). |
| CPI (Cycles Per Instruction) | A measure of processor performance, representing the average number of clock cycles required to execute one instruction. |
| IPC (Instructions Per Cycle) | A measure of processor performance, representing the average number of instructions that can be executed in one clock cycle. It is the inverse of CPI ($IPC = 1/CPI$). |
| RISC-V | An open-standard instruction set architecture (ISA) based on Reduced Instruction Set Computer principles, designed to be simple, modular, and extensible. |
| Architectural State Elements | The set of registers, the program counter (PC), and memory that define the current state of a processor visible to the programmer. |
| Register File (RF) | A small, fast memory unit within a processor that stores frequently used data values in general-purpose registers. |
| Program Counter (PC) | A special register that holds the memory address of the next instruction to be fetched and executed. |
| Instruction Memory | A memory unit that stores program instructions. In some designs, it is separate from data memory. |
| Data Memory | A memory unit that stores data values used and produced by the program. |
| Immediate | A constant value that is part of an instruction itself, used directly in operations without needing to be fetched from memory or the register file. |
| ALU (Arithmetic Logic Unit) | The digital circuit within a processor that performs arithmetic and logical operations on operands. |
| Control Signals | Signals generated by the control unit that dictate the operation of various components within the datapath, such as selecting operations, enabling writes, or routing data. |
| Hazard | A situation in a pipelined processor where the next instruction cannot execute in the next clock cycle due to data or control dependencies on previous instructions. |
| Data Hazard | A pipeline hazard that occurs when an instruction needs data that has not yet been produced by a preceding instruction that is still in the pipeline. |
| Control Hazard | A pipeline hazard that occurs when the processor fetches instructions that are not supposed to be executed, typically due to a branch instruction whose outcome is not yet known. |
| Forwarding (Bypassing) | A technique used in pipelined processors to resolve data hazards by sending the result of an instruction directly from its execution stage to a subsequent instruction that needs it, before it is written back to the register file. |
| Stall (Pipeline Bubble) | A mechanism in a pipelined processor to temporarily halt the pipeline and insert a "no-operation" (NOP) or "bubble" to resolve hazards, allowing dependencies to be met. |
| Flush | The process of discarding instructions that have been incorrectly fetched due to a mispredicted branch or a resolved hazard in a pipelined processor. |
| Branch Prediction | A technique used in pipelined processors to guess the outcome of a conditional branch instruction before it is fully resolved, aiming to keep the pipeline full. |
| Superscalar Processor | A processor that can execute more than one instruction per clock cycle by having multiple execution units and the ability to issue multiple instructions simultaneously. |
| Out-of-Order (OOO) Processor | A processor that can execute instructions in an order different from their original program sequence, as long as data dependencies are respected, to improve instruction-level parallelism. |
| Register Renaming | A technique used in out-of-order processors to eliminate false data dependencies (Write-After-Write and Write-After-Read) by assigning temporary physical registers to architectural registers. |
| SIMD (Single Instruction Multiple Data) | A parallel processing architecture where a single instruction operates on multiple data elements simultaneously. |
| Multithreading | A technique that allows a single processor core to execute multiple threads of a program concurrently by duplicating architectural state, improving throughput. |
| Multiprocessor | A system with multiple processing units (cores) that can operate in parallel, either sharing memory or having their own dedicated memory systems. |
| Micro-operation (µ-op) | A very basic operation that a processor can perform, often used to decompose complex instructions into simpler steps for execution, particularly in CISC architectures. |
| Branch Target Buffer (BTB) | A cache used in branch predictors to store the target addresses of recently executed branches, speeding up branch resolution. |
| SPECINT2000 | A benchmark suite used to measure the performance of integer computation in computer systems. |