character

A character is an information unit, in simple terms it is a letter, number, punctuation mark, Chinese character, etc.

The best definition of a character is a Unicode character:

It is a global standard that can represent characters in all languages in the world. The identification (code point) of Unicode characters is represented by 4 to 6 hexadecimal digits and prefixed U+.

byte

Byte is the measurement unit of computer information. One byte represents eight bits, and the range of stored values is 0~255.

The byte of ByteDance Internet company is this byte.

Bytes are machine, characters are human.

The algorithm used to convert human characters into machine bytes is called encoding, and vice versa is called decoding.

Different algorithms have different relationships between bytes and characters:

bytes and bytearray

The byte is actually a binary sequence. The immutable bytes type and variable bytearray type are used to store binary sequences. Examples of them are as follows:

>>> cafe = bytes("café", encoding="utf_8")
>>> cafe
b'caf\xc3\xa9'
>>> cafe[0]
99
>>> cafe[:1]
b'c'
>>> cafe_arr = bytearray(cafe)
>>> cafe_arr
bytearray(b'caf\xc3\xa9')
>>> cafe_arr[-1:]
bytearray(b'\xa9')

In particular, cafe[0] returns an integer, cafe[:1] returns a binary sequence, this is because s[0] == s[:1] only holds for str type , and for other types, s[ i] returns an element, s[i:i+1] returns a sequence of the same type.

The binary sequence is actually a sequence of integers. Their literal representation contains ASCII characters (ASCII can only represent characters in the English system), such as cafe b'caf\xc3\xa9'. The specific rules are:

~Use ASCII characters directly from spaces to characters
Tabs \t, line feeds \n, carriage returns \r, escape characters\\
Use hexadecimal escape sequences for other characters, such as \x00null bytes

There are several ways to construct bytes and bytearray objects:

A str object and an encoding keyword parameter
An iterable object with a value between 0 and 255
An object that implements the buffer protocol, such as bytes, bytearray, memoryview, array.array

memoryview and struct

Memoryview allows memory to be shared between binary data structures, and struct can extract structured information from sequences.

The example is as follows, extract the width and height of a GIF image:

import struct

with open("filter.gif", "rb") as fp:
    img = memoryview(fp.read())

# The byte sequence is not copied here, because the memoryview is used
header = img[:10]
print(bytes(header))  # b'GIF89a+\x02\xe6\x00'

# < is a small byte sequence, 3s3s is two 3-byte sequences, HH is two 16-bit binary integers
# Type, Version, Width, Height
struct.unpack("<3s3sHH", header)  # (b'GIF', b'89a', 555, 230)

# Delete the reference and free the memory occupied by the memoryview instance
del header
del img

summary

This article introduces the concept of characters and bytes and the relationship between them. A character corresponds to one or more bytes. Characters are human and bytes are machine. Encoding means that human characters are converted to machine bytes, and vice versa is called decoding. Then introduced the types bytes and bytearray of the binary sequence, and the tools memoryview and struct of the binary sequence respectively.