Strings & Text Processing

0/5 in this phase0/54 across the roadmap

📖 Concept

Strings in Python are immutable sequences of Unicode characters. Python 3 uses UTF-8 by default, meaning strings can contain characters from any language. Understanding string methods, formatting, and encoding is essential for data processing, web development, and file I/O.

String creation:

Single quotes: 'hello'
Double quotes: "hello" (identical to single)
Triple quotes: '''multi-line''' or """docstring"""
Raw strings: r"no escaping"
Byte strings: b"binary data"
f-strings: f"Hello, {name}!" (Python 3.6+)

f-strings (formatted string literals) are the modern, preferred way to format strings. They're faster than .format() and % formatting, more readable, and support arbitrary expressions.

Common string patterns:

Validation: .isdigit(), .isalpha(), .isalnum(), .isspace()
Transformation: .upper(), .lower(), .title(), .strip(), .replace()
Searching: .find(), .index(), .startswith(), .endswith(), in
Splitting/Joining: .split(), .join(), .partition()

Regular expressions (re module) provide powerful pattern matching for complex text processing tasks.

💻 Code Example

codeTap to expand ⛶

1# ============================================================
2# String methods
3# ============================================================
4s = "  Hello, World!  "
5
6# Whitespace handling
7print(s.strip())       # "Hello, World!" — remove leading/trailing whitespace
8print(s.lstrip())      # "Hello, World!  " — left strip only
9print(s.rstrip())      # "  Hello, World!" — right strip only
10
11# Case transformation
12print("hello".upper())         # "HELLO"
13print("HELLO".lower())         # "hello"
14print("hello world".title())   # "Hello World"
15print("hello world".capitalize())  # "Hello world"
16print("Hello".swapcase())      # "hELLO"
17
18# Searching
19text = "Python is awesome and Python is fun"
20print(text.find("Python"))       # 0 (first occurrence index)
21print(text.find("Python", 1))    # 26 (find after index 1)
22print(text.find("Java"))         # -1 (not found)
23print(text.count("Python"))      # 2
24
25print(text.startswith("Python"))  # True
26print(text.endswith("fun"))       # True
27
28# Replacing
29print(text.replace("Python", "JavaScript"))
30# "JavaScript is awesome and JavaScript is fun"
31
32print(text.replace("Python", "JS", 1))  # Replace only first occurrence
33# "JS is awesome and Python is fun"
34
35# Splitting and joining
36csv_line = "Alice,30,NYC,Engineer"
37parts = csv_line.split(",")    # ['Alice', '30', 'NYC', 'Engineer']
38print(parts)
39
40words = "  hello   world  foo  ".split()  # Split on any whitespace
41print(words)  # ['hello', 'world', 'foo']
42
43# Join — the opposite of split
44print(", ".join(["Alice", "Bob", "Charlie"]))  # "Alice, Bob, Charlie"
45print("\n".join(["line1", "line2", "line3"]))  # Multi-line string
46
47# partition — split into 3 parts at first occurrence
48email = "user@example.com"
49user, at, domain = email.partition("@")
50print(f"User: {user}, Domain: {domain}")
51
52# ============================================================
53# f-strings (Python 3.6+) — the modern way
54# ============================================================
55name = "Alice"
56age = 30
57balance = 1234567.891
58
59# Basic interpolation
60print(f"Name: {name}, Age: {age}")
61
62# Expressions
63print(f"Next year: {age + 1}")
64print(f"Name length: {len(name)}")
65print(f"Uppercase: {name.upper()}")
66
67# Format specifiers
68print(f"Currency: ${balance:,.2f}")      # $1,234,567.89
69print(f"Percentage: {0.856:.1%}")        # 85.6%
70print(f"Padded: {42:08d}")              # 00000042
71print(f"Binary: {255:08b}")             # 11111111
72print(f"Hex: {255:#06x}")               # 0x00ff
73print(f"Scientific: {12345.6789:.2e}")   # 1.23e+04
74
75# Alignment
76print(f"{'left':<20}|")     # "left                |"
77print(f"{'right':>20}|")    # "               right|"
78print(f"{'center':^20}|")   # "       center       |"
79print(f"{'padded':*^20}")   # "*******padded*******"
80
81# Debugging with = (Python 3.8+)
82x = 42
83print(f"{x = }")          # "x = 42"
84print(f"{x * 2 = }")      # "x * 2 = 84"
85print(f"{name = !r}")     # "name = 'Alice'" (with repr)
86
87# Multi-line f-strings
88report = (
89    f"User Report\n"
90    f"{'='*40}\n"
91    f"Name:    {name}\n"
92    f"Age:     {age}\n"
93    f"Balance: ${balance:,.2f}"
94)
95print(report)
96
97# ============================================================
98# Regular expressions basics
99# ============================================================
100import re
101
102text = "Contact: alice@email.com or bob@company.org"
103
104# Find all email addresses
105emails = re.findall(r'[\w.]+@[\w.]+\.\w+', text)
106print(emails)  # ['alice@email.com', 'bob@company.org']
107
108# Search (first match)
109match = re.search(r'(\w+)@(\w+)\.\w+', text)
110if match:
111    print(f"Full match: {match.group()}")    # alice@email.com
112    print(f"Username: {match.group(1)}")     # alice
113    print(f"Domain: {match.group(2)}")       # email
114
115# Substitution
116cleaned = re.sub(r'\d{3}-\d{4}', 'XXX-XXXX', "Call 555-1234 or 555-5678")
117print(cleaned)  # "Call XXX-XXXX or XXX-XXXX"
118
119# Compiled patterns (faster for repeated use)
120email_pattern = re.compile(r'[\w.]+@[\w.]+\.\w+')
121all_emails = email_pattern.findall(text)
122
123# Common patterns
124phone = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')
125url = re.compile(r'https?://[\w./\-?=&]+')
126ip = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')
127
128# ============================================================
129# String encoding
130# ============================================================
131# str → bytes (encode)
132text = "Hello, 世界! 🌍"
133encoded = text.encode('utf-8')
134print(encoded)  # b'Hello, \xe4\xb8\x96\xe7\x95\x8c! \xf0\x9f\x8c\x8d'
135
136# bytes → str (decode)
137decoded = encoded.decode('utf-8')
138print(decoded)  # "Hello, 世界! 🌍"
139
140# Length difference: str counts characters, bytes counts bytes
141print(len(text))     # 12 characters
142print(len(encoded))  # 18 bytes (CJK chars = 3 bytes, emoji = 4 bytes)

🏋️ Practice Exercise

Exercises:

Write a function is_palindrome(s) that checks if a string is a palindrome, ignoring case, spaces, and punctuation. Example: "A man, a plan, a canal: Panama" → True.
Implement a Caesar cipher: write encrypt(text, shift) and decrypt(text, shift) functions. Handle uppercase, lowercase, and non-alpha characters.
Parse a log file line like "2024-01-15 14:30:22 [ERROR] Database connection timeout" and extract the date, time, level, and message using both .split() and regex.
Create a format_table(headers, rows) function that displays data in a formatted ASCII table with column alignment. Use f-string formatting.
Write a function that converts between cases: snake_case → camelCase → PascalCase → kebab-case. Handle edge cases.
Use regex to validate: email addresses, phone numbers (various formats), and URLs. Write comprehensive test cases for each.

⚠️ Common Mistakes

Using + for string concatenation in loops — it creates new string objects each time (O(n^2)). Use ''.join(list) or f-strings instead.
Forgetting that strings are immutable — s.upper() returns a NEW string, it doesn't modify s. You must assign: s = s.upper().
Using .format() or % formatting when f-strings are available (Python 3.6+). f-strings are faster, more readable, and more powerful.
Not using raw strings for regex patterns — re.search('\d+', text) should be re.search(r'\d+', text). Without r, \d is interpreted as an escape sequence.
Confusing str.find() (returns -1 if not found) with str.index() (raises ValueError if not found). Use find() when absence is expected, index() when absence is an error.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Strings & Text Processing

Was this topic helpful?

← PreviousDictionaries & Sets Next →Functions & Arguments