mirror of
https://github.com/krahets/hello-algo.git
synced 2026-04-13 15:29:53 +08:00
deploy
This commit is contained in:
@@ -3645,24 +3645,24 @@
|
||||
|
||||
<!-- Page content -->
|
||||
<h1 id="34-character-encoding">3.4 Character encoding *<a class="headerlink" href="#34-character-encoding" title="Permanent link">¶</a></h1>
|
||||
<p>In the computer system, all data is stored in binary form, and characters (represented by char) are no exception. To represent characters, we need to develop a "character set" that defines a one-to-one mapping between each character and binary numbers. With the character set, computers can convert binary numbers to characters by looking up the table.</p>
|
||||
<p>In the computer system, all data is stored in binary form, and <code>char</code> is no exception. To represent characters, we need to develop a "character set" that defines a one-to-one mapping between each character and binary numbers. With the character set, computers can convert binary numbers to characters by looking up the table.</p>
|
||||
<h2 id="341-ascii-character-set">3.4.1 ASCII character set<a class="headerlink" href="#341-ascii-character-set" title="Permanent link">¶</a></h2>
|
||||
<p>The "ASCII code" is one of the earliest character sets, officially known as the American Standard Code for Information Interchange. It uses 7 binary digits (the lower 7 bits of a byte) to represent a character, allowing for a maximum of 128 different characters. As shown in Figure 3-6, ASCII includes uppercase and lowercase English letters, numbers 0 ~ 9, various punctuation marks, and certain control characters (such as newline and tab).</p>
|
||||
<p>The <u>ASCII code</u> is one of the earliest character sets, officially known as the American Standard Code for Information Interchange. It uses 7 binary digits (the lower 7 bits of a byte) to represent a character, allowing for a maximum of 128 different characters. As shown in Figure 3-6, ASCII includes uppercase and lowercase English letters, numbers 0 ~ 9, various punctuation marks, and certain control characters (such as newline and tab).</p>
|
||||
<p><a class="glightbox" href="../character_encoding.assets/ascii_table.png" data-type="image" data-width="100%" data-height="auto" data-desc-position="bottom"><img alt="ASCII code" class="animation-figure" src="../character_encoding.assets/ascii_table.png" /></a></p>
|
||||
<p align="center"> Figure 3-6 ASCII code </p>
|
||||
|
||||
<p>However, <strong>ASCII can only represent English characters</strong>. With the globalization of computers, a character set called "EASCII" was developed to represent more languages. It expands from the 7-bit structure of ASCII to 8 bits, enabling the representation of 256 characters.</p>
|
||||
<p>However, <strong>ASCII can only represent English characters</strong>. With the globalization of computers, a character set called <u>EASCII</u> was developed to represent more languages. It expands from the 7-bit structure of ASCII to 8 bits, enabling the representation of 256 characters.</p>
|
||||
<p>Globally, various region-specific EASCII character sets have been introduced. The first 128 characters of these sets are consistent with the ASCII, while the remaining 128 characters are defined differently to accommodate the requirements of different languages.</p>
|
||||
<h2 id="342-gbk-character-set">3.4.2 GBK character set<a class="headerlink" href="#342-gbk-character-set" title="Permanent link">¶</a></h2>
|
||||
<p>Later, it was found that <strong>EASCII still could not meet the character requirements of many languages</strong>. For instance, there are nearly a hundred thousand Chinese characters, with several thousand used regularly. In 1980, the Standardization Administration of China released the "GB2312" character set, which included 6763 Chinese characters, essentially fulfilling the computer processing needs for the Chinese language.</p>
|
||||
<p>However, GB2312 could not handle some rare and traditional characters. The "GBK" character set expands GB2312 and includes 21886 Chinese characters. In the GBK encoding scheme, ASCII characters are represented with one byte, while Chinese characters use two bytes.</p>
|
||||
<p>Later, it was found that <strong>EASCII still could not meet the character requirements of many languages</strong>. For instance, there are nearly a hundred thousand Chinese characters, with several thousand used regularly. In 1980, the Standardization Administration of China released the <u>GB2312</u> character set, which included 6763 Chinese characters, essentially fulfilling the computer processing needs for the Chinese language.</p>
|
||||
<p>However, GB2312 could not handle some rare and traditional characters. The <u>GBK</u> character set expands GB2312 and includes 21886 Chinese characters. In the GBK encoding scheme, ASCII characters are represented with one byte, while Chinese characters use two bytes.</p>
|
||||
<h2 id="343-unicode-character-set">3.4.3 Unicode character set<a class="headerlink" href="#343-unicode-character-set" title="Permanent link">¶</a></h2>
|
||||
<p>With the rapid evolution of computer technology and a plethora of character sets and encoding standards, numerous problems arose. On the one hand, these character sets generally only defined characters for specific languages and could not function properly in multilingual environments. On the other hand, the existence of multiple character set standards for the same language caused garbled text when information was exchanged between computers using different encoding standards.</p>
|
||||
<p>Researchers of that era thought: <strong>What if a comprehensive character set encompassing all global languages and symbols was developed? Wouldn't this resolve the issues associated with cross-linguistic environments and garbled text?</strong> Inspired by this idea, the extensive character set, Unicode, was born.</p>
|
||||
<p>"Unicode" is referred to as "统一码" (Unified Code) in Chinese, theoretically capable of accommodating over a million characters. It aims to incorporate characters from all over the world into a single set, providing a universal character set for processing and displaying various languages and reducing the issues of garbled text due to different encoding standards.</p>
|
||||
<p><u>Unicode</u> is referred to as "统一码" (Unified Code) in Chinese, theoretically capable of accommodating over a million characters. It aims to incorporate characters from all over the world into a single set, providing a universal character set for processing and displaying various languages and reducing the issues of garbled text due to different encoding standards.</p>
|
||||
<p>Since its release in 1991, Unicode has continually expanded to include new languages and characters. As of September 2022, Unicode contains 149,186 characters, including characters, symbols, and even emojis from various languages. In the vast Unicode character set, commonly used characters occupy 2 bytes, while some rare characters may occupy 3 or even 4 bytes.</p>
|
||||
<p>Unicode is a universal character set that assigns a number (called a "code point") to each character, <strong>but it does not specify how these character code points should be stored in a computer system</strong>. One might ask: How does a system interpret Unicode code points of varying lengths within a text? For example, given a 2-byte code, how does the system determine if it represents a single 2-byte character or two 1-byte characters?</p>
|
||||
<p>A straightforward solution to this problem is to store all characters as equal-length encodings. As shown in Figure 3-7, each character in "Hello" occupies 1 byte, while each character in "算法" (algorithm) occupies 2 bytes. We could encode all characters in "Hello 算法" as 2 bytes by padding the higher bits with zeros. This method would enable the system to interpret a character every 2 bytes, recovering the content of the phrase.</p>
|
||||
<p><strong>A straightforward solution to this problem is to store all characters as equal-length encodings</strong>. As shown in Figure 3-7, each character in "Hello" occupies 1 byte, while each character in "算法" (algorithm) occupies 2 bytes. We could encode all characters in "Hello 算法" as 2 bytes by padding the higher bits with zeros. This method would enable the system to interpret a character every 2 bytes, recovering the content of the phrase.</p>
|
||||
<p><a class="glightbox" href="../character_encoding.assets/unicode_hello_algo.png" data-type="image" data-width="100%" data-height="auto" data-desc-position="bottom"><img alt="Unicode encoding example" class="animation-figure" src="../character_encoding.assets/unicode_hello_algo.png" /></a></p>
|
||||
<p align="center"> Figure 3-7 Unicode encoding example </p>
|
||||
|
||||
@@ -3895,7 +3895,7 @@ aria-label="Footer"
|
||||
<div class="md-copyright">
|
||||
|
||||
<div class="md-copyright__highlight">
|
||||
Copyright © 2022-2024 krahets<br>The website content is licensed under <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0</a>
|
||||
Copyright © 2024 krahets<br>The website content is licensed under <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0</a>
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user