Unicode decoding error
Raised when a Unicode-related decoding error occurs. This almost always happens when you try to read a sequence of bytes (e.g., from a file or network) and interpret it as text using the wrong encoding.
- 1Reading a file saved with one encoding (e.g., `latin-1`) while trying to decode it as another (e.g., `utf-8`).
- 2Receiving binary data from a network socket and trying to decode it as text without knowing the correct encoding.
- 3The data is corrupted and contains invalid byte sequences for the specified encoding.
This error is triggered when trying to decode a byte sequence that is not valid UTF-8, using the UTF-8 codec.
# 0xff is not a valid start byte in UTF-8
byte_sequence = b'ÿ'
try:
byte_sequence.decode('utf-8')
except UnicodeDecodeError as e:
print(f"Caught UnicodeDecodeError: {e}")
expected output
Caught UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Fix 1
Specify the correct encoding when opening files
WHEN You know the encoding of the file you are reading.
# If you know the file is encoded in latin-1
try:
with open('my_file.txt', 'r', encoding='latin-1') as f:
content = f.read()
except FileNotFoundError:
print("File not found.")
Why this works
Explicitly providing the correct encoding to `open()` tells Python how to interpret the bytes, preventing decode errors.
Fix 2
Provide an error handling strategy
WHEN A file might contain a few invalid characters that you can afford to ignore or replace.
byte_sequence = b'helloÿworld'
# 'replace' will insert a placeholder for invalid bytes
text = byte_sequence.decode('utf-8', errors='replace')
print(text)
# 'ignore' will simply discard invalid bytes
text_ignored = byte_sequence.decode('utf-8', errors='ignore')
print(text_ignored)
Why this works
The `.decode()` method's `errors` parameter allows you to specify a policy for handling bytes that can't be decoded, such as replacing them (`'replace'`) or discarding them (`'ignore'`).
b"\xff".decode("utf-8") # UnicodeDecodeError: invalid start bytetry:
text = data.decode("utf-8")
except UnicodeDecodeError:
text = data.decode("utf-8", errors="replace")with open("file.txt", "r", encoding="utf-8", errors="replace") as f:
content = f.read() # never raises UnicodeDecodeError✕ Guessing encodings one by one until something works
This is unreliable and can lead to silently corrupted text (mojibake). The correct solution is to find out the actual encoding of your data source.
cpython/Objects/unicodeobject.c
The Absolute Minimum Every Developer Must Know About Unicode ↗Content generated with AI assistance and reviewed for accuracy. Found an error? hello@errcodes.dev