原文链接: https://seiya.me/blog/reading-linux-v0.01
作者: Seiya Nuta
通过DeepL渣翻而来

探索 Linux v0.01 的内部结构(Exploring the internals of Linux v0.01)

Linux kernel is often mentioned as a overwhelmingly large open source software. As of this writing, the latest version is v6.5-rc5, which consists of 36M lines of code. Needless to say, Linux is a fruit of hard work of many contributors over the decades.

Linux 内核作为一个庞大的开源软件经常被提及。截至本文撰写之时,其最新版本为 v6.5-rc5,包含 3 600 万行代码。毋庸讳言,Linux 是许多贡献者几十年来辛勤工作的成果。

However, the first version of Linux, v0.01 was pretty small. It consisted of only 10,239 lines of code. Excluding comments and blank lines, it was only 8,670 lines. It’s small enough to understand and is a good starting point to learn about the internals of UNIX-like operating system kernels.

然而,Linux 的第一个版本(v0.01)却非常小。它只有 10,239 行代码。除去注释和空行,只有 8670 行。它小巧易懂,是了解 UNIX 类操作系统内核的良好起点。

Reading v0.01 was really fun for me. It was like visiting Computer History Museum in Mountain View - finally I witnessed tales are indeed true! I wrote this post to share this exciting experience with you. Let’s dive in!

阅读 v0.01 对我来说非常有趣。就像参观山景城的计算机历史博物馆一样–我终于见证了传说的真实性!我写这篇文章,就是想与大家分享这段激动人心的经历。让我们深入了解一下吧!

Disclaimer: Obviously I’m not the author of Linux v0.01. If you find any mistakes in this post, please let me know!

免责声明:我显然不是 Linux v0.01 的作者。如果你发现本文有任何错误,请告诉我!

系统调用是怎样的?(How do system calls look like?)

v0.01 has 66 system calls. Here’s the list of them:

v0.01 有 66 个系统调用。下面是它们的列表:

access acct alarm break brk chdir chmod
chown chroot close creat dup dup2 execve
exit fcntl fork fstat ftime getegid geteuid
getgid getpgrp setsid getpid getppid
getuid gtty ioctl kill link lock lseek
mkdir mknod mount mpx nice open pause
phys pipe prof ptrace read rename rmdir
setgid setpgid setuid setup signal stat
stime stty sync time times ulimit umask
umount uname unlink ustat utime waitpid write
  • It supports reading, writing, creating, and deleting files and directories. Also, other fundamental concepts like chmod(2) (permission), chown(2) (owner), and pipe(2) (inter-process communication) are also supported.

  • fork(2) and execve(2) were there. Only a.out executable format was supported.

  • The concept of sockets was not implemented. Thus, no network support.

  • Some features like mount(2) were not implemented. They just return ENOSYS:

  • 它支持文件和目录的读取、写入、创建和删除。此外,它还支持其他基本概念,如 chmod(2)(权限)、chown(2)(所有者)和 pipe(2)(进程间通信)。

  • 此外,还支持 fork(2) 和 execve(2)。只支持 a.out 可执行文件格式。

  • 没有实现套接字的概念。因此,不支持网络。

  • 一些功能如 mount(2) 也没有实现。它们只会返回 ENOSYS:

int sys_mount()
{
	return -ENOSYS;
}

针对英特尔 386 架构的深度硬编码(Deeply hardcoded for Intel 386 architecture)

There is a very famous debate Linus had with Andrew S. Tanenbaum, the author of MINIX, about the design of operating systems: monolithic vs. microkernel, which is better design?

Linus 曾与《MINIX》的作者 Andrew S. Tanenbaum 就操作系统的设计进行过一场非常著名的辩论:单核与微内核,哪种设计更好?

Tanenbaum pointed out that Linux is (or was) not portable because it was deeply hardcoded for Intel 386 (i386):

Tanenbaum 指出,Linux 现在(或曾经)不具备可移植性,因为它是为英特尔 386(i386)深度硬编码的:

MINIX was designed to be reasonably portable, and has been ported from the Intel line to the 680x0 (Atari, Amiga, Macintosh), SPARC, and NS32016. LINUX is tied fairly closely to the 80x86. Not the way to go.
MINIX 被设计成具有相当的可移植性,并已从英特尔系列移植到 680x0(Atari、Amiga、Macintosh)、SPARC 和 NS32016。LINUX 与 80x86 有着相当紧密的联系。不适合

It’s indeed true. Linux v0.01 was deeply hardcoded for i386. Here’s the implementation of strcpy in include/string.h:

的确如此。Linux v0.01 对 i386 进行了深度硬编码。下面是 include/string.h 中 strcpy 的实现:

extern inline char * strcpy(char * dest,const char *src)
{
__asm__("cld\n"
	"1:\tlodsb\n\t"
	"stosb\n\t"
	"testb %%al,%%al\n\t"
	"jne 1b"
	::"S" (src),"D" (dest):"si","di","ax");
return dest;
}

It’s written in assembly with string instructions of i386. Yes it can be found as an optimized implementation of strcpy in today’s Linux, but it’s in include/string.h - not in somewhere like include/i386/string.h. Moreover, no #ifdef to switch the implementation for different architectures. It’s just hardcoded for Intel 386.

它是用 i386 的字符串指令汇编编写的。是的,在当今的 Linux 中可以找到 strcpy 的优化实现,但它是在 include/string.h 中,而不是在 include/i386/string.h 这样的地方。此外,没有 #ifdef 来切换不同架构的实现。它只是为英特尔 386 硬编码。

Also, only PC/AT devices were supported:

而且,只支持 PC/AT 设备:

As you may noticed, they’re not in drivers directory as in today’s Linux. They’re hardcoded in core subsystems.

正如你可能注意到的,它们并不像现在的 Linux 系统那样位于驱动程序目录中。它们被硬编码在核心子系统中。

“FREAX”

I’ve read in somewhere that Linus originally named his kernel “FREAX”. Makefile in Linux v0.01 still had the following comment:

我在某处读到过莱纳斯最初将他的内核命名为 “FREAX”。Linux v0.01 中的 Makefile 仍有如下注释:

# Makefile for the FREAX-kernel.

It was indeed FREAX!

确实是 FREAX!

What’s the file system supported in v0.01?(0.01 版支持什么文件系统?)

Today, Linux supports a variety of file systems such as ext4, Btrfs, and XFS. What about v0.01? ext2? Nope, here’s a hint from include/linux/fs.h:

如今,Linux 支持 ext4、Btrfs 和 XFS 等多种文件系统。那么 0.01 版的 ext2 呢?不,include/linux/fs.h.中给出了提示:

#define SUPER_MAGIC 0x137F

The answer is, as GPT-4 correctly guessed, MINIX file system!

答案正如 GPT-4 所猜测的那样,是 MINIX 文件系统

Fun fact: ext (“extended file system”), the predecessor of ext2/ext3/ext4, is inspired by MINIX file system.

有趣的事实:ext(“扩展文件系统”),即 ext2/ext3/ext4 的前身,就是受到 MINIX 文件系统的启发。

There “probably” won’t be any reason to change the scheduler (“可能"没有理由更改调度程序)

Here’s the scheduler of Linux v0.01:

这是 Linux v0.01 的调度程序:

while (1) {
		c = -1;
		next = 0;
		i = NR_TASKS;
		p = &task[NR_TASKS];
		while (--i) {
			if (!*--p)
				continue;
			if ((*p)->state == TASK_RUNNING && (*p)->counter > c)
				c = (*p)->counter, next = i;
		}
		if (c) break;
		for(p = &LAST_TASK ; p > &FIRST_TASK ; --p)
			if (*p)
				(*p)->counter = ((*p)->counter >> 1) +
						(*p)->priority;
	}
	switch_to(next);

i and p hold the task’s index in the task table (not PID!) and the pointer to task_struct respectively. The key variable is counter in task_struct ((*p)->counter). The scheduler picks up the task with the largest counter value and switches to it. If all runnable tasks have counter value of 0, it updates each task’s counter value by counter = (counter » 1) + priority and restarts the loop. Note that counter » 1 is a faster way to divide by 2.

i 和 p 分别表示任务在任务表中的索引(不是 PID!)和指向 task_struct 的指针。关键变量是 task_struct 中的计数器((*p)->counter)。调度程序会选择计数器值最大的任务并切换到它。如果所有可运行的任务的计数器值都是 0,调度程序会通过计数器 = (counter » 1) + 优先级更新每个任务的计数器值,然后重新开始循环。请注意,计数器 » 1 是一种更快的除以 2 的方法。

The key point would be the counter update. It also updates the counter value of non-runnable tasks. This means that if a task is waiting for I/O for a long time, and its priority is higher than 2, counter value will monotonically increase increase until a certain upper bound (edited) when counter is updated. This is just my guess, but I think this is for prioritizing rarely-runnable-but-latency-sensitive tasks like shell, which would waits for keyboard typing in most of the life.

关键是计数器更新。它还会更新不可运行任务的计数器值。这意味着,如果某个任务长时间等待 I/O,且其优先级高于 2,那么计数器值将会单调递增,直到达到某个上限(编辑值)时计数器才会更新。这只是我的猜测,但我认为这是为很少运行但对延迟敏感的任务设定优先级,比如 shell,它在大部分时间里都在等待键盘输入。

Lastly, switch_to(next) is a macro which switches the CPU context to the picked task. It’s well described in here. In short, it was based on a x86-specific feature called Task State Segment (TSS), which is no longer used for task management in x86-64 architecture.

最后,switch_to(next) 是一个将 CPU 上下文切换到所选任务的宏。这里有详细介绍。简而言之,它基于 x86 特有的任务状态分段(TSS)功能,在 x86-64 架构中,该功能已不再用于任务管理。

By the way, there’s an interesting comment about the scheduler:

顺便说一句,关于日程安排器的评论很有意思:

 * 'schedule()' is the scheduler function. This is GOOD CODE! There
 * probably won't be any reason to change this, as it should work well
 * in all circumstances (ie gives IO-bound processes good response etc).

Yes it’s indeed good code. Unfortunately (or fortunately), this prophecy is false. Linux became one of most practical and performant kernel which has introduced many scheduling improvements and new algorithms over the years, like Completely Fair Scheduler (CFS).

是的,这的确是好代码。不幸(或幸运)的是,这一预言是错误的。Linux 已成为最实用、性能最好的内核之一,多年来它引入了许多调度改进和新算法,比如完全公平调度器(CFS)。

Kernel panic in 5 lines (内核恐慌 5 行)

volatile void panic(const char * s)
{
	printk("Kernel panic: %s\n\r",s);
	for(;;);
}

Let the user know it went wrong, and hang the system. Period.

让用户知道出了问题,并挂起系统。时间到

fork(2) in kernel space? (内核空间中的 fork(2)?)

The main portion of kernel initialization can be found in init/main.c (fun fact: this file still exists in today’s Linux kernel and initializes the kernel):

内核初始化的主要部分可以在 init/main.c 中找到(有趣的是:这个文件在今天的 Linux 内核中仍然存在,并对内核进行初始化):

void main(void)		/* This really IS void, no error here. */
{			/* The startup routine assumes (well, ...) this */
/*
 * Interrupts are still disabled. Do necessary setups, then
 * enable them
 */
	time_init();
	tty_init();
	trap_init();
	sched_init();
	buffer_init();
	hd_init();
	sti();
	move_to_user_mode();
	if (!fork()) {		/* we count on this going ok */
		init();
	}
/*
 *   NOTE!!   For any other task 'pause()' would mean we have to get a
 * signal to awaken, but task0 is the sole exception (see 'schedule()')
 * as task 0 gets activated at every idle moment (when no other tasks
 * can run). For task0 'pause()' just means we go check if some other
 * task can run, and if not we return here.
 */
	for(;;) pause();
}

void init(void)
{
	int i,j;

	setup();
	if (!fork())
		_exit(execve("/bin/update",NULL,NULL));
	(void) open("/dev/tty0",O_RDWR,0);
	(void) dup(0);
	(void) dup(0);
	printf("%d buffers = %d bytes buffer space\n\r",NR_BUFFERS,
		NR_BUFFERS*BLOCK_SIZE);
	printf(" Ok.\n\r");
	if ((i=fork())<0)
		printf("Fork failed in init\r\n");
	else if (!i) {
		close(0);close(1);close(2);
		setsid();
		(void) open("/dev/tty0",O_RDWR,0);
		(void) dup(0);
		(void) dup(0);
		_exit(execve("/bin/sh",argv,envp));
	}
	j=wait(&i);
	printf("child %d died with code %04x\n",j,i);
	sync();
	_exit(0);	/* NOTE! _exit, not exit() */
}

It calls each subsystem’s initialization functions. Pretty straightforward. But there’s something interesting: it calls fork(2) in kernel’s main(). Also, init() looks like an ordinary implementation in user space, but it’s hardcoded in the kernel code!

它调用每个子系统的初始化函数。非常简单。但有趣的是:它在内核的 main() 中调用了 fork(2)。此外,init() 看起来像是用户空间的普通实现,但它是内核代码中的硬编码!

It looks as if it’s fork(2)-ing in the kernel space, but it’s actually not. The trick is in move_to_user_mode():

看起来好像是在内核空间中执行 fork(2),但实际上并非如此。诀窍在于 move_to_user_mode():

#define move_to_user_mode() \
__asm__ ("movl %%esp,%%eax\n\t" \ // EAX = current stack pointer
	"pushl $0x17\n\t" \           // SS (user data seg)
	"pushl %%eax\n\t" \           // ESP
	"pushfl\n\t" \                // EFLAGS
	"pushl $0x0f\n\t" \           // CS (user code seg)
	"pushl $1f\n\t" \             // EIP (return address)
	"iret\n" \                    // switch to user mode
	"1:\tmovl $0x17,%%eax\n\t" \  // IRET returns to this address
	"movw %%ax,%%ds\n\t" \        // Set DS to user data segment
	"movw %%ax,%%es\n\t" \        // Set ES to user data segment
	"movw %%ax,%%fs\n\t" \        // Set FS to user data segment
	"movw %%ax,%%gs" \            // Set GS to user data segment
	:::"ax")                      // No RET instruction here: 
                                // continue executing following
                                // lines!

You don’t need to fully understand the assembly code above. What it does is to switch to the user mode using IRET instruction but continue executing the following lines in the kernel code with the current stack pointer! Thus, the following if (!fork()) is executed in user mode and fork(2) is actually a system call.

你不需要完全理解上面的汇编代码。它的作用是使用 IRET 指令切换到用户模式,但使用当前堆栈指针继续执行内核代码中的后续行!因此,下面的 if (!fork()) 是在用户模式下执行的,而 fork(2) 实际上是一个系统调用。

Linus didn’t have a machine with 8MB RAM (Linus没有 8MB 内存的机器)

 * For those with more memory than 8 Mb - tough luck. I've
 * not got it, why should you :-) The source is here. Change
 * it. (Seriously - it shouldn't be too difficult. ...

Today, machines with 8GB RAM are very common. Furthermore, 8GB is not enough at all for software engineers ;)

如今,拥有 8GB 内存的机器已非常普遍。此外,对于软件工程师来说,8GB 根本不够用。)

Hard to compile with modern toolchains (难以用现代工具链编译)

Lastly, I tried to compile the kernel with modern toolchains but failed to do so. I thought GCC (or C itself) has good backward compatibility, but it’s not sufficient. Even with older standard -std=gnu90 caused compile errors that are not trivial to fix.

最后,我尝试用现代工具链编译内核,但没有成功。我本以为 GCC(或 C 语言本身)具有良好的向后兼容性,但事实并非如此。即使使用较旧的标准 -std=gnu90 也会导致编译错误,而这些错误的修复并非易事。

One fun fact is Linus had used his own GCC with a feature named -mstring-insns:

一个有趣的事实是,莱纳斯使用了他自己的 GCC,其中有一个名为 -mstring-insns 的功能:

# If you don't have '-mstring-insns' in your gcc (and nobody but me has :-)
# remove them from the CFLAGS defines.

I’m not sure what it is, but it seems to be a feature to support (or optimize?) x86 string instructions.

我不确定它是什么,但似乎是支持(或优化?) x86 字符串指令的功能。

If you managed to compile the kernel with modern toolchains, write an article and send me a link :D

如果你能用现代工具链编译内核,请写一篇文章并给我发一个链接 :D

Read Yourself! (自己阅读!)

I hope you enjoyed reading the source code of Linux v0.01 as much as I did. If you’re interested in v0.01, download the tarball of v0.01 from kernel.org. Reading the code is not so hard especially if you’ve read xv6 before. Linux v0.01 is minimalistic but is very well written.

希望你和我一样喜欢阅读 Linux v0.01 的源代码。如果你对 v0.01 感兴趣,请从 kernel.org 下载 v0.01 的压缩包。阅读代码并不难,尤其是如果你以前读过 xv6。Linux v0.01虽然简约,但写得非常好。

  • written by Seiya Nuta