All You Need to Know About SIMD – And How GPUs Employ It

Visualize an army company marching toward a battlefield. At its head is a Major, attired in full military regalia, with outstanding service medals glinting dazzlingly on his/ her chest. Every soldier marching in lockstep, presenting arms when instructed, halting when instructed, turning when instructed and assuming battle positions when instructed by the Major. This tightly coupled, highly synchronous company is the simplest real-world example of Single Instruction, Multiple Data Streams (SIMD).

But what is SIMD in computer terminology in reality? There is a little backstory. We need to understand the functioning of GPUs first. Read on!

GPUs are specialised processors that employ multiple computing cores operating in parallel to achieve very high precision coupled with lightning-fast response time. So efficient are the latest class of these devices that Nvidia has added brand new set of capabilities and demonstrated that their A100 GPU can outperform others by 237 times.

Because of these benefits, the demand for GPUs for accelerating AI/ML, HPC and AR/VR workloads is rocketing straight into the stratosphere. Allied Market Research forecasts that the global GPU market, valued at USD 19.75 billion in 2019, will grow at 33.6% CAGR between 2020 and 2027 and reach USD 200.8 billion by the end of this period.

Within this massive market, notes leading market research firm Gartner, AI data centers and data science workloads are projected to increase their use of discrete GPUs by 20% CAGR in the 2020-24 period. Thereby propelling discrete GPU market segment to USD 11.5 billion worth by 2024.

Try Super-fast, Secure Cloud GPU Today!

Get Free $300 Credit

Role of SIMDs — Introduction to Flynn’s Taxonomy

How GPUs attain exponential computational throughput is a question worth exploring for anyone envisioning to deploy them. While the answer to this is complex, the building blocks at its base are principles known as Flynn’s taxonomy, after American Professor Michael J. Flynn. He is the one who first codified this classification in 1972. The original taxonomy has 4 segments –

  1. Single Instruction Stream, Single Data Stream (SISD) – A simple sequential computer which fetches a single instruction stream from its memory to run on a single data stream. Thus, employing no parallel processing (except in rudimentary form by fragmenting an individual task into several subtasks and operating on them concurrently after checking for dependencies (if any).
  2. Single Instruction Stream, Multiple Data Streams (SIMD) – Multiple data streams, each being simultaneously subjected to the same operation by the respective processing core handling it. The instruction itself might be sequential or parallel in nature. But it is applied to different data sets in lockstep manner so that each core waits for other cores to finish processing their data before all the cores simultaneously proceed to the next data batch.
  3. Multiple Instruction Streams, Single Data Stream (MISD) – Unique architecture generally used for fault testing.
  4. Multiple Instruction Streams, Multiple Data Streams (MIMD) – Autonomous processors using shared or exclusive memory space and simultaneously executing different operations on different data.


Single Instruction stream, Multiple threads (SIMT) is a sub-classification under SIMD as categorised by Prof. Flynn. It involves each processing core running multithreading and possessing its own independent ALU, register file, cache memory and data memory. Like SIMD, all the instructions in all the threads in SIMT are also executed in lock-step manner.

The key benefit of SIMT is that it enhances computational density manifolds (via multithreading) while relying on one control logic block to concurrently manage multiple data streams. However, this comes with significantly more hardware complexity, and also entails financial costs since each core must be equipped with its own unique memory location.


In computer architecture, predication refers to incorporating conditions culminating in non-branching decisions, i.e., run operations if condition is satisfied, skip operations if condition not satisfied.

While at first glance, it appears deceivingly similar to the run-of-the-mill if-else code pattern (both are condition-dependent sequential workflows), there are certain differences.

For instance, if-else conditionality operates on the entirety of the data, and reads and executes each condition as-and-when encountered. Predication, on the other hand, is designed to facilitate fragmentation of incoming instructions along different conditions, processing (or not processing) these fragments by individual cores as per specified conditions, and final consolidation of results construed from individual cores.

It is evident that predication improves the performance manifolds by executing multiple instruction fragments in parallel using semi-independent processing cores. This also enhances computational and energy efficiency since the CPU’s almost entire logic block is involved in executing instructions (or fragments thereof). If-else conditionalities, on the other hand, entail the existence of large amounts of idle CPU logic at any given instant.

SIMT & Predication in GPUs –

In simplest terms, SIMT involves running the same instruction across multiple threads in parallel. Modern-day GPUs resolve their massive AI/ML and HPC workloads in an extremely short time by further advancing the SIMT architecture at multiple levels. Consecutive threads proceeding in lock-step are grouped as warps, which are then grouped as thread blocks, and later assigned to streaming multiprocessors (SM). SMs thus employ instruction-level parallelism (but not branch prediction or speculative execution) to enable GPUs to attain very high computational throughput.

A warp is considered ready for execution when an instruction has no data dependencies. Within the thread block each warp is assigned to a warp scheduler, which can seamlessly switch between concurrent warps without overhead, thus preventing stalling whenever instruction execution is hindered by dependencies. This rapid swapping between warps effectively eliminates instruction latency, subject to there being enough active warps on the SM. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. On the other hand, if multiple warps are ready for execution, the SM uses a scheduling policy to determine which warp proceeds to execution first.

The SMs contain several memory caches, such as shared cache (for rapid data exchange between threads), L2 cache (to reduce latency to local or global memory), and constant memory (for a fast broadcast of reads from constant memory).

Besides incorporating SIMT architecture, modern GPU cores are also optimized to independently perform predication on the data supplied to them. This constitutes the backbone of GPU efficiency since the cores not only attain very high throughput, but can also independently ascertain if supplied data satisfies certain conditions and should be processed or skipped altogether.

SIMD usage in AI/ML

Be it traffic management, weather forecasting, structural fault testing, or of any other such description, AI/ML modules involve training computer neural networks to determine patterns to predict repetitions and ascertain anomalies. This training is designed on the bedrock of massive amounts of monotonous, often repetitive data, and can be accelerated through large-scale parallel processing (SIMT) and predication that GPUs deploy. The paramount significance of these two concepts is even more resonant when training an AI inference model.

For example, imagine an AI-enabled traffic management system that has been trained to detect and identify vehicles and log their number plates. This training will obviously involve multiple images/ videos of vehicles being subjected to the same instructions – identify vehicles, seek number plate, read and record number plate. This is SIMT.

Predication ensures that this system can identify and eliminate non-vehicles from its visual stream instead of processing those visuals as well for number plate recognition.

AI inference would further hone this system to deploy the humongous data sets and intelligence it has gathered to distinguish new vehicles amidst the traffic stream, even when it has never seen those vehicles before.

GPUs expedite the development and training of such intuitive systems for far-reaching practical applications. Ace Cloud Hosting offers Nvidia A100, the world’s fastest GPUs, at highly reasonable prices. If still on the wall if GPUs are suitable for your business end goals or not, book an appointment with our consultants.

You may also like:

About Nolan Foster

With 20+ years of expertise in building cloud-native services and security solutions, Nolan Foster spearheads Public Cloud and Managed Security Services at Ace Cloud Hosting. He is well versed in the dynamic trends of cloud computing and cybersecurity.
Foster offers expert consultations for empowering cloud infrastructure with customized solutions and comprehensive managed security.

Find Nolan Foster on:

Leave a Reply

Your email address will not be published. Required fields are marked *