Revisit the English version (#1835)

* Review the English version using Claude-4.5.

* Update mkdocs.yml

* Align the section titles.

* Bug fixes
This commit is contained in:
Yudong Jin
2025-12-30 17:54:01 +08:00
committed by GitHub
parent 091afd38b4
commit 45e1295241
106 changed files with 4195 additions and 3398 deletions

View File

@@ -2,65 +2,65 @@
### Key review
- Data structures can be categorized from two perspectives: logical structure and physical structure. Logical structure describes the logical relationships between data, while physical structure describes how data is stored in memory.
- Frequently used logical structures include linear structures, trees, and networks. We usually divide data structures into linear (arrays, linked lists, stacks, queues) and non-linear (trees, graphs, heaps) based on their logical structure. The implementation of hash tables may involve both linear and non-linear data structures.
- When a program is running, data is stored in memory. Each memory space has a corresponding address, and the program accesses data through these addresses.
- Physical structures can be divided into continuous space storage (arrays) and discrete space storage (linked lists). All data structures are implemented using arrays, linked lists, or a combination of both.
- The basic data types in computers include integers (`byte`, `short`, `int`, `long`), floating-point numbers (`float`, `double`), characters (`char`), and booleans (`bool`). The value range of a data type depends on its size and representation.
- Sign-magnitude, 1's complement, 2's complement are three methods of encoding integers in computers, and they can be converted into each other. The most significant bit of the sign-magnitude is the sign bit, and the remaining bits represent the value of the number.
- Integers are encoded by 2's complement in computers. The benefits of this representation include (i) the computer can unify the addition of positive and negative integers, (ii) no need to design special hardware circuits for subtraction, and (iii) no ambiguity of positive and negative zero.
- The encoding of floating-point numbers consists of 1 sign bit, 8 exponent bits, and 23 fraction bits. Due to the exponent bit, the range of floating-point numbers is much greater than that of integers, but at the cost of precision.
- ASCII is the earliest English character set, with 1 byte in length and a total of 127 characters. GBK is a popular Chinese character set, which includes more than 20,000 Chinese characters. Unicode aims to provide a complete character set standard that includes characters from various languages in the world, thus solving the garbled character problem caused by inconsistent character encoding methods.
- UTF-8 is the most popular and general Unicode encoding method. It is a variable-length encoding method with good scalability and space efficiency. UTF-16 and UTF-32 are fixed-length encoding methods. When encoding Chinese characters, UTF-16 takes up less space than UTF-8. Programming languages like Java and C# use UTF-16 encoding by default.
- Data structures can be classified from two perspectives: logical structure and physical structure. Logical structure describes the logical relationships between data elements, while physical structure describes how data is stored in computer memory.
- Common logical structures include linear, tree, and network structures. We typically classify data structures as linear (arrays, linked lists, stacks, queues) and non-linear (trees, graphs, heaps) based on their logical structure. The implementation of hash tables may involve both linear and non-linear data structures.
- When a program runs, data is stored in computer memory. Each memory space has a corresponding memory address, and the program accesses data through these memory addresses.
- Physical structures are primarily divided into contiguous space storage (arrays) and dispersed space storage (linked lists). All data structures are implemented using arrays, linked lists, or a combination of both.
- Basic data types in computers include integers `byte`, `short`, `int`, `long`, floating-point numbers `float`, `double`, characters `char`, and booleans `bool`. Their value ranges depend on the size of space they occupy and their representation method.
- Sign-magnitude, 1's complement, and 2's complement are three methods for encoding numbers in computers, and they can be converted into each other. The most significant bit of sign-magnitude is the sign bit, and the remaining bits represent the value of the number.
- Integers are stored in computers in 2's complement form. Under 2's complement representation, computers can treat the addition of positive and negative numbers uniformly, without needing to design special hardware circuits for subtraction, and there is no ambiguity of positive and negative zero.
- The encoding of floating-point numbers consists of 1 sign bit, 8 exponent bits, and 23 fraction bits. Due to the exponent bits, the range of floating-point numbers is much larger than that of integers, at the cost of sacrificing precision.
- ASCII is the earliest English character set, with a length of 1 byte, containing a total of 127 characters. GBK is a commonly used Chinese character set, containing over 20,000 Chinese characters. Unicode is committed to providing a complete character set standard, collecting characters from various languages around the world, thereby solving the garbled text problem caused by inconsistent character encoding methods.
- UTF-8 is the most popular Unicode encoding method, with excellent universality. It is a variable-length encoding method with good scalability, effectively improving storage space efficiency. UTF-16 and UTF-32 are fixed-length encoding methods. When encoding Chinese characters, UTF-16 occupies less space than UTF-8. Programming languages such as Java and C# use UTF-16 encoding by default.
### Q & A
**Q**: Why does a hash table contain both linear and non-linear data structures?
**Q**: Why do hash tables contain both linear and non-linear data structures?
The underlying structure of a hash table is an array. To resolve hash collisions, we may use "chaining" (discussed in a later section, "Hash collision"): each bucket in the array points to a linked list, which may transform into a tree (usually a red-black tree) when its length is larger than a certain threshold.
From a storage perspective, the underlying structure of a hash table is an array, where each bucket might contain a value, a linked list, or a tree. Therefore, hash tables may contain both linear data structures (arrays, linked lists) and non-linear data structures (trees).
The underlying structure of a hash table is an array. To resolve hash collisions, we may use "chaining" (discussed in the subsequent "Hash Collision" section): each bucket in the array points to a linked list, which may be converted to a tree (usually a red-black tree) when the list length exceeds a certain threshold.
From a storage perspective, the underlying structure of a hash table is an array, where each bucket slot may contain a value, a linked list, or a tree. Therefore, hash tables may contain both linear data structures (arrays, linked lists) and non-linear data structures (trees).
**Q**: Is the length of the `char` type 1 byte?
The length of the `char` type is determined by the encoding method of the programming language. For example, Java, JavaScript, TypeScript, and C# all use UTF-16 encoding (to save Unicode code points), so the length of the `char` type is 2 bytes.
The length of the `char` type is determined by the encoding method used by the programming language. For example, Java, JavaScript, TypeScript, and C# all use UTF-16 encoding (to store Unicode code points), so the `char` type has a length of 2 bytes.
**Q**: Is there any ambiguity when we refer to array-based data structures as "static data structures"? The stack can also perform "dynamic" operations such as popping and pushing.
**Q**: Is there ambiguity in referring to array-based data structures as "static data structures"? Stacks can also perform "dynamic" operations such as push and pop.
The stack can implement dynamic data operations, but the data structure is still "static" (the length is fixed). Although array-based data structures can dynamically add or remove elements, their capacity is fixed. If the stack size exceeds the pre-allocated size, then the old array will be copied into a newly created and larger array.
Stacks can indeed implement dynamic data operations, but the data structure is still "static" (fixed length). Although array-based data structures can dynamically add or remove elements, their capacity is fixed. If the data volume exceeds the pre-allocated size, a new larger array needs to be created, and the contents of the old array must be copied to the new array.
**Q**: When building a stack (queue), its size is not specified, so why are they "static data structures"?
**Q**: When constructing a stack (queue), its size is not specified. Why are they "static data structures"?
In high-level programming languages, we do not need to manually specify the initial capacity of stacks (queues); this task is automatically completed within the class. For example, the initial capacity of Java's `ArrayList` is usually 10. Furthermore, the expansion operation is also completed automatically. See the subsequent "List" chapter for details.
In high-level programming languages, we do not need to manually specify the initial capacity of a stack (queue); this work is automatically completed within the class. For example, the initial capacity of Java's `ArrayList` is typically 10. Additionally, the expansion operation is also automatically implemented. See the subsequent "List" section for details.
**Q**The method of converting the sign-magnitude to the 2's complement is "first negate and then add 1", so converting the 2's complement to the sign-magnitude should be its inverse operation "first subtract 1 and then negate".
However, the 2's complement can also be converted to the sign-magnitude through "first negate and then add 1", why is this?
**Q**: The method of converting sign-magnitude to 2's complement is "first negate then add 1". So converting 2's complement to sign-magnitude should be the inverse operation "first subtract 1 then negate". However, 2's complement can also be converted to sign-magnitude through "first negate then add 1". Why is this?
**A**This is because the mutual conversion between the sign-magnitude and the 2's complement is equivalent to computing the "complement". We first define the complement: assuming $a + b = c$, then we say that $a$ is the complement of $b$ to $c$, and vice versa, $b$ is the complement of $a$ to $c$.
This is because the mutual conversion between sign-magnitude and 2's complement is actually the process of computing the "complement". Let us first define the complement: assuming $a + b = c$, then we say that $a$ is the complement of $b$ to $c$, and conversely, $b$ is the complement of $a$ to $c$.
Given a binary number $0010$ with length $n = 4$, if this number is the sign-magnitude (ignoring the sign bit), then its 2's complement can be obtained by "first negating and then adding 1":
Given an $n = 4$ bit binary number $0010$, if we treat this number as sign-magnitude (ignoring the sign bit), then its 2's complement can be obtained through "first negate then add 1":
$$
0010 \rightarrow 1101 \rightarrow 1110
$$
Observe that the sum of the sign-magnitude and the 2's complement is $0010 + 1110 = 10000$, i.e., the 2's complement $1110$ is the "complement" of the sign-magnitude $0010$ to $10000$. **This means that the above "first negate and then add 1" is equivalent to computing the complement to $10000$**.
We find that the sum of sign-magnitude and 2's complement is $0010 + 1110 = 10000$, which means the 2's complement $1110$ is the "complement" of sign-magnitude $0010$ to $10000$. **This means the above "first negate then add 1" is actually the process of computing the complement to $10000$**.
So, what is the "complement" of $1110$ to $10000$? We can still compute it by "negating first and then adding 1":
So, what is the "complement" of 2's complement $1110$ to $10000$? We can still use "first negate then add 1" to obtain it:
$$
1110 \rightarrow 0001 \rightarrow 0010
$$
In other words, the sign-magnitude and the 2's complement are each other's "complement" to $10000$, so "sign-magnitude to 2's complement" and "2's complement to sign-magnitude" can be implemented with the same operation (first negate and then add 1).
In other words, sign-magnitude and 2's complement are each other's "complement" to $10000$, so "sign-magnitude to 2's complement" and "2's complement to sign-magnitude" can be implemented using the same operation (first negate then add 1).
Of course, we can also use the inverse operation of "first negate and then add 1" to find the sign-magnitude of the 2's complement $1110$, that is, "first subtract 1 and then negate":
Of course, we can also use the inverse operation to find the sign-magnitude of 2's complement $1110$, that is, "first subtract 1 then negate":
$$
1110 \rightarrow 1101 \rightarrow 0010
$$
To sum up, "first negate and then add 1" and "first subtract 1 and then negate" are both computing the complement to $10000$, and they are equivalent.
In summary, both "first negate then add 1" and "first subtract 1 then negate" are computing the complement to $10000$, and they are equivalent.
Essentially, the "negate" operation is actually to find the complement to $1111$ (because `sign-magnitude + 1's complement = 1111` always holds); and the 1's complement plus 1 is equal to the 2's complement to $10000$.
Essentially, the "negate" operation is actually finding the complement to $1111$ (because `sign-magnitude + 1's complement = 1111` always holds); and adding 1 to the 1's complement yields the 2's complement, which is the complement to $10000$.
We take $n = 4$ as an example in the above, and it can be generalized to any binary number with any number of digits.
The above uses $n = 4$ as an example, and it can be generalized to binary numbers of any number of bits.