Strings & Text Processing

0/5 in this phase0/54 across the roadmap

πŸ“– Concept

Strings in Python are immutable sequences of Unicode characters. Python 3 uses UTF-8 by default, meaning strings can contain characters from any language. Understanding string methods, formatting, and encoding is essential for data processing, web development, and file I/O.

String creation:

  • Single quotes: 'hello'
  • Double quotes: "hello" (identical to single)
  • Triple quotes: '''multi-line''' or """docstring"""
  • Raw strings: r"no escaping"
  • Byte strings: b"binary data"
  • f-strings: f"Hello, {name}!" (Python 3.6+)

f-strings (formatted string literals) are the modern, preferred way to format strings. They're faster than .format() and % formatting, more readable, and support arbitrary expressions.

Common string patterns:

  • Validation: .isdigit(), .isalpha(), .isalnum(), .isspace()
  • Transformation: .upper(), .lower(), .title(), .strip(), .replace()
  • Searching: .find(), .index(), .startswith(), .endswith(), in
  • Splitting/Joining: .split(), .join(), .partition()

Regular expressions (re module) provide powerful pattern matching for complex text processing tasks.

πŸ’» Code Example

codeTap to expand β›Ά
1# ============================================================
2# String methods
3# ============================================================
4s = " Hello, World! "
5
6# Whitespace handling
7print(s.strip()) # "Hello, World!" β€” remove leading/trailing whitespace
8print(s.lstrip()) # "Hello, World! " β€” left strip only
9print(s.rstrip()) # " Hello, World!" β€” right strip only
10
11# Case transformation
12print("hello".upper()) # "HELLO"
13print("HELLO".lower()) # "hello"
14print("hello world".title()) # "Hello World"
15print("hello world".capitalize()) # "Hello world"
16print("Hello".swapcase()) # "hELLO"
17
18# Searching
19text = "Python is awesome and Python is fun"
20print(text.find("Python")) # 0 (first occurrence index)
21print(text.find("Python", 1)) # 26 (find after index 1)
22print(text.find("Java")) # -1 (not found)
23print(text.count("Python")) # 2
24
25print(text.startswith("Python")) # True
26print(text.endswith("fun")) # True
27
28# Replacing
29print(text.replace("Python", "JavaScript"))
30# "JavaScript is awesome and JavaScript is fun"
31
32print(text.replace("Python", "JS", 1)) # Replace only first occurrence
33# "JS is awesome and Python is fun"
34
35# Splitting and joining
36csv_line = "Alice,30,NYC,Engineer"
37parts = csv_line.split(",") # ['Alice', '30', 'NYC', 'Engineer']
38print(parts)
39
40words = " hello world foo ".split() # Split on any whitespace
41print(words) # ['hello', 'world', 'foo']
42
43# Join β€” the opposite of split
44print(", ".join(["Alice", "Bob", "Charlie"])) # "Alice, Bob, Charlie"
45print("\n".join(["line1", "line2", "line3"])) # Multi-line string
46
47# partition β€” split into 3 parts at first occurrence
48email = "user@example.com"
49user, at, domain = email.partition("@")
50print(f"User: {user}, Domain: {domain}")
51
52# ============================================================
53# f-strings (Python 3.6+) β€” the modern way
54# ============================================================
55name = "Alice"
56age = 30
57balance = 1234567.891
58
59# Basic interpolation
60print(f"Name: {name}, Age: {age}")
61
62# Expressions
63print(f"Next year: {age + 1}")
64print(f"Name length: {len(name)}")
65print(f"Uppercase: {name.upper()}")
66
67# Format specifiers
68print(f"Currency: ${balance:,.2f}") # $1,234,567.89
69print(f"Percentage: {0.856:.1%}") # 85.6%
70print(f"Padded: {42:08d}") # 00000042
71print(f"Binary: {255:08b}") # 11111111
72print(f"Hex: {255:#06x}") # 0x00ff
73print(f"Scientific: {12345.6789:.2e}") # 1.23e+04
74
75# Alignment
76print(f"{'left':<20}|") # "left |"
77print(f"{'right':>20}|") # " right|"
78print(f"{'center':^20}|") # " center |"
79print(f"{'padded':*^20}") # "*******padded*******"
80
81# Debugging with = (Python 3.8+)
82x = 42
83print(f"{x = }") # "x = 42"
84print(f"{x * 2 = }") # "x * 2 = 84"
85print(f"{name = !r}") # "name = 'Alice'" (with repr)
86
87# Multi-line f-strings
88report = (
89 f"User Report\n"
90 f"{'='*40}\n"
91 f"Name: {name}\n"
92 f"Age: {age}\n"
93 f"Balance: ${balance:,.2f}"
94)
95print(report)
96
97# ============================================================
98# Regular expressions basics
99# ============================================================
100import re
101
102text = "Contact: alice@email.com or bob@company.org"
103
104# Find all email addresses
105emails = re.findall(r'[\w.]+@[\w.]+\.\w+', text)
106print(emails) # ['alice@email.com', 'bob@company.org']
107
108# Search (first match)
109match = re.search(r'(\w+)@(\w+)\.\w+', text)
110if match:
111 print(f"Full match: {match.group()}") # alice@email.com
112 print(f"Username: {match.group(1)}") # alice
113 print(f"Domain: {match.group(2)}") # email
114
115# Substitution
116cleaned = re.sub(r'\d{3}-\d{4}', 'XXX-XXXX', "Call 555-1234 or 555-5678")
117print(cleaned) # "Call XXX-XXXX or XXX-XXXX"
118
119# Compiled patterns (faster for repeated use)
120email_pattern = re.compile(r'[\w.]+@[\w.]+\.\w+')
121all_emails = email_pattern.findall(text)
122
123# Common patterns
124phone = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')
125url = re.compile(r'https?://[\w./\-?=&]+')
126ip = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')
127
128# ============================================================
129# String encoding
130# ============================================================
131# str β†’ bytes (encode)
132text = "Hello, δΈ–η•Œ! 🌍"
133encoded = text.encode('utf-8')
134print(encoded) # b'Hello, \xe4\xb8\x96\xe7\x95\x8c! \xf0\x9f\x8c\x8d'
135
136# bytes β†’ str (decode)
137decoded = encoded.decode('utf-8')
138print(decoded) # "Hello, δΈ–η•Œ! 🌍"
139
140# Length difference: str counts characters, bytes counts bytes
141print(len(text)) # 12 characters
142print(len(encoded)) # 18 bytes (CJK chars = 3 bytes, emoji = 4 bytes)

πŸ‹οΈ Practice Exercise

Exercises:

  1. Write a function is_palindrome(s) that checks if a string is a palindrome, ignoring case, spaces, and punctuation. Example: "A man, a plan, a canal: Panama" β†’ True.

  2. Implement a Caesar cipher: write encrypt(text, shift) and decrypt(text, shift) functions. Handle uppercase, lowercase, and non-alpha characters.

  3. Parse a log file line like "2024-01-15 14:30:22 [ERROR] Database connection timeout" and extract the date, time, level, and message using both .split() and regex.

  4. Create a format_table(headers, rows) function that displays data in a formatted ASCII table with column alignment. Use f-string formatting.

  5. Write a function that converts between cases: snake_case β†’ camelCase β†’ PascalCase β†’ kebab-case. Handle edge cases.

  6. Use regex to validate: email addresses, phone numbers (various formats), and URLs. Write comprehensive test cases for each.

⚠️ Common Mistakes

  • Using + for string concatenation in loops β€” it creates new string objects each time (O(n^2)). Use ''.join(list) or f-strings instead.

  • Forgetting that strings are immutable β€” s.upper() returns a NEW string, it doesn't modify s. You must assign: s = s.upper().

  • Using .format() or % formatting when f-strings are available (Python 3.6+). f-strings are faster, more readable, and more powerful.

  • Not using raw strings for regex patterns β€” re.search('\d+', text) should be re.search(r'\d+', text). Without r, \d is interpreted as an escape sequence.

  • Confusing str.find() (returns -1 if not found) with str.index() (raises ValueError if not found). Use find() when absence is expected, index() when absence is an error.

πŸ’Ό Interview Questions

🎀 Mock Interview

Practice a live interview for Strings & Text Processing