| ASICs (single‑purpose) |
Google TPU‑v4 |
Google |
Cloud‑scale inference & training (TensorFlow, XLA) |
5 TFLOP/s (FP16) per core, 8 GB HBM3, 400 W per chip |
Large‑scale ML data centers that run the same model over and over. |
|
Graphcore IPU‑700M |
Graphcore |
High‑performance ML training, graph‑based workloads |
700 TFLOP/s (int8), 4 GB HBM3, 800 W per 8‑chip “pod” |
Research clusters, hyper‑parameter search, graph‑intensive models (e.g., GNNs). |
|
Habana Gaudi 2 |
Habana Labs (Intel) |
Data‑center inference & training |
200 TFLOP/s (int8), 32 GB HBM3, 300 W per chip |
Edge‑to‑cloud inference, cost‑efficient training. |
|
Cerebras WSE‑2 |
Cerebras Systems |
Extreme‑scale training |
400 TFLOP/s (FP32) on 1 wafer, 4.3 TB L1/2 memory, 8 kW |
Training the biggest models (hundreds of billions of params). |
|
Myriad X / X2 |
Intel |
Edge inference (vision, audio) |
1 TFLOP/s (FP16), 1.2 GB LPDDR4, <1 W |
Mobile/robotics, IoT, automotive. |
|
Apple Neural Engine (ANE) |
Apple |
On‑device inference (iOS, macOS) |
~350 GFLOP/s (int8), 1.5 W |
Mobile ML, AR/VR, Face ID. |
|
Qualcomm Snapdragon Neural Processing Engine |
Qualcomm |
Mobile inference |
~300 GFLOP/s (int8), ~1.5 W |
Smartphones, wearables, automotive infotainment. |
|
NVIDIA A100 |
NVIDIA (GPU with ASIC‑style tensor cores) |
Mixed training & inference |
312 TFLOP/s (FP16), 40 GB HBM3, 400 W |
Cloud training, AI inference, HPC workloads. |
| GPUs (massively parallel) |
NVIDIA H100 |
NVIDIA |
Training & inference, HPC, AI workloads |
80 TFLOP/s (FP16), 80 GB HBM3, 700 W |
AI research, generative models, large‑scale inference. |
|
AMD MI300X |
AMD |
Training & inference |
140 TFLOP/s (FP16), 96 GB HBM3, 800 W |
HPC clusters, AI training, scientific simulation. |
|
NVIDIA RTX 4090 |
NVIDIA |
High‑end gaming & inference |
35 TFLOP/s (FP16), 48 GB GDDR6X, 450 W |
Edge inference, content creation, small‑scale training. |
|
Apple M1/M2 Pro/Max |
Apple |
Edge inference + ML compute (macOS) |
35 GFLOP/s (int8) per chip, <30 W |
Desktop ML, video analytics, AR. |
| FPGA‑based |
Xilinx Alveo U55C |
Xilinx (AMD) |
Customizable inference pipelines |
1‑3 TFLOP/s (int8), 64 GB DDR5, 1.5 kW |
Low‑latency inference in telecom, finance. |
|
Intel Stratix 10 |
Intel |
Custom ML accelerators |
10‑20 TFLOP/s (int8) per card, 2 kW |
Research, ASIC prototyping, edge. |
| Hybrid / “Programmable ASIC” |
Intel Gaudi 2 (ASIC‑like, but programmable via XLA) |
Intel |
Cloud inference, training |
200 TFLOP/s (int8), 32 GB HBM3, 300 W |
Middle‑tier inference, cost‑efficient scaling. |
|
AMD Instinct MI200 |
AMD |
HPC + AI workloads |
55 TFLOP/s (FP32), 32 GB HBM2, 700 W |
Scientific simulation with ML components. |
|
NVIDIA Ampere & Hopper |
NVIDIA |
Training & inference |
312 TFLOP/s (FP16) on A100, 80 TFLOP/s on H100 |
General AI research & production. |