Chapter 2 Architecture

< Back

建筑学

本章介绍了一些 CPU/GPU 架构入门知识。建议你先跳过或略读本章，但为了完全理解所有内容，你需要回过头来进一步了解细节。

CPU Architecture

CPU架构

什么是CPU？它在哪里？

CPU ，即中央处理器，是现代计算机的大脑。它通常是主板上的芯片。如果你组装过台式工作站，你就会明白我的意思。

图2.1 A CPU的照片（我在谷歌上找到的，我不知道它是什么型号）。

所有可以被视为计算机的东西都应该有一个 CPU：控制台、台式机、笔记本电脑、手机……因为 CPU 的任务是解释和执行程序中的指令。

任务是什么？

我们想看看游戏编程的背景。CPU（通常）通常被设计用于完成游戏中的一般任务，例如：

游戏逻辑：物理、脚本执行……
调度：与 GPU、内存、键盘、网络、磁盘等 I/O 进行协调……
系统服务：操作系统在CPU上运行，它也决定了游戏可以使用的资源。

无论如何，正如字面意思所示，它是中央的，并且负责处理。

任务是如何完成的？

简而言之，4个步骤。

提取：从内存中获取指令。
解码：通过解码来理解指令。
执行：运行指令。
写回：存储结果。

现在你可能会有疑问：那么什么是指令？我什么时候发出过指令？

没错——你就是程序员，你编写代码。代码会被翻译成机器码：一串二进制位。你写完代码（无论用什么语言，C++、C#、Python、Java……）后，它们会被编译成汇编代码；这些代码存储在内存中。例如，一条 x86 指令可能如下所示

mov eax, 3

mov ebx, 4

添加 eax、ebx

这些代码是从以下 C 代码翻译而来的：

int a = 3;

int b = 4;

int c = a + b;

其对应的机器代码可能类似于：

B8 03 00 00 00

BB 04 00 00 00

01 D8

因此 CPU 的执行正如我上面提到的那样：

取指：CPU 取指令（例如第一行的B8 03 00 00 00 ）。
解码：CPU 解码B8 ，并理解其含义为mov eax, imm32 。
执行：CPU将03 00 00 00 （即3）写入寄存器eax。
写回：CPU存储eax = 3。

然后它继续执行下一条指令。

对于发送给 CPU 的指令，上述步骤将永远循环。

建筑学

现在到了最重要的（也可能是最复杂的）部分——CPU 的硬件架构。我不想在这部分讲得太深，因为我们只想让速度更快。要理解速度差异的原因，你需要了解数据是如何流动的。

Figure 2.2 A simplified diagram showing CPU architecture.

The above diagram shows a typical 4-core CPU architecture. Modern CPUs usually have multiple cores. In each core, we may have an L1 Data Cache (L1D), an L1 Instruction Cache (L1I), and a L2 Cache. And they have a common, shared L3 Cache. As what we can see from the diagram, L3 Cache is usually the "largest", and L1 is the "smallest"; on the other hand, L1 is usually the fastest, while L3 cache is the slowest.

Commonly asked question is, why do we want 3 levels of cache?

There are multiple reasons.

Firstly, the cache is made of SRAM. The larger size means that the word lines and bit lines are longer, a single access drives more capacity, it results in a rise of energy cost, which also results in a longer resistance capacity latency. By nature it means, the larger the cache is, the slower it would be.

Second, a high performance core usually need to send multiple load/save within a clock cycle. For example a CPU might have a frequency of 3 GHz, which means it has 3 billion clock cycles in a second. It has to be very fast, otherwise we can't get the result from the next clock cycle.

Therefore, we have two smaller but fast L1 Cache (L1D and L1I) and a little slower L2 Cache, with a slower but larger L3 shared Cache - we can also use the shared cache to exchange data among different cores.

现代 CPU 有什么特别之处？

作为技术爱好者，我们购买新电脑时会花费大量金钱来选择性能强劲的 CPU。这意味着不同型号的 CPU 确实存在性能差异。

更强的并行能力

现代 CPU 通常具有多个核心，可以同时执行某些任务。性能强大的 CPU 拥有更多核心用于计算。

无序指令执行

早期的 CPU 可以同时处理 3-4 个微指令（μops），调度必须遵循这些顺序。功能强大的 CPU 可以处理更多 μops，但执行顺序可能有所不同。例如，下面这条简单的指令

a = b + c;

d = e + f;

g = h + i;

在早期的 CPU 上，这些指令会被逐个执行，而在现代的 CPU 上，它们可能会被安排在不同的核心上同时运行。

更大的缓存 - 更低的延迟

根据以上知识，如果能够保证速度，缓存应该尽可能大。现代 CPU 的各级缓存都更大。另一方面，如果发生缓存未命中（在较小的缓存中更常见），我们将需要从 DRAM 中获取数据，这将导致大约数百个时钟周期的延迟。

更聪明的分支预测

CPU 在处理分支代码（例如 if/else、for 循环）时，通常会预测结果。如果预测错误，就会浪费数十个时钟周期。现代强大的 CPU 拥有更精准的预测能力。

好吧，我们肯定还有很多关于 CPU 的知识需要了解，而且你可能对一些非常著名的现代 CPU 的架构感兴趣，比如 Apple M4 Max、Intel Core i9 14900 KF 等。你绝对应该了解更多技术细节，了解它们为何如此优秀——否则，这些知识现在就足够了；当我们研究 CPU 瓶颈或移动设备优化时，我们可能会深入研究更多细节。

GPU Architecture

GPU架构

CPU vs. GPU: What are different?

建筑学