Strings & Text Processing
π Concept
Strings in Python are immutable sequences of Unicode characters. Python 3 uses UTF-8 by default, meaning strings can contain characters from any language. Understanding string methods, formatting, and encoding is essential for data processing, web development, and file I/O.
String creation:
- Single quotes:
'hello' - Double quotes:
"hello"(identical to single) - Triple quotes:
'''multi-line'''or"""docstring""" - Raw strings:
r"no escaping" - Byte strings:
b"binary data" - f-strings:
f"Hello, {name}!"(Python 3.6+)
f-strings (formatted string literals) are the modern, preferred way to format strings. They're faster than .format() and % formatting, more readable, and support arbitrary expressions.
Common string patterns:
- Validation:
.isdigit(),.isalpha(),.isalnum(),.isspace() - Transformation:
.upper(),.lower(),.title(),.strip(),.replace() - Searching:
.find(),.index(),.startswith(),.endswith(),in - Splitting/Joining:
.split(),.join(),.partition()
Regular expressions (re module) provide powerful pattern matching for complex text processing tasks.
π» Code Example
1# ============================================================2# String methods3# ============================================================4s = " Hello, World! "56# Whitespace handling7print(s.strip()) # "Hello, World!" β remove leading/trailing whitespace8print(s.lstrip()) # "Hello, World! " β left strip only9print(s.rstrip()) # " Hello, World!" β right strip only1011# Case transformation12print("hello".upper()) # "HELLO"13print("HELLO".lower()) # "hello"14print("hello world".title()) # "Hello World"15print("hello world".capitalize()) # "Hello world"16print("Hello".swapcase()) # "hELLO"1718# Searching19text = "Python is awesome and Python is fun"20print(text.find("Python")) # 0 (first occurrence index)21print(text.find("Python", 1)) # 26 (find after index 1)22print(text.find("Java")) # -1 (not found)23print(text.count("Python")) # 22425print(text.startswith("Python")) # True26print(text.endswith("fun")) # True2728# Replacing29print(text.replace("Python", "JavaScript"))30# "JavaScript is awesome and JavaScript is fun"3132print(text.replace("Python", "JS", 1)) # Replace only first occurrence33# "JS is awesome and Python is fun"3435# Splitting and joining36csv_line = "Alice,30,NYC,Engineer"37parts = csv_line.split(",") # ['Alice', '30', 'NYC', 'Engineer']38print(parts)3940words = " hello world foo ".split() # Split on any whitespace41print(words) # ['hello', 'world', 'foo']4243# Join β the opposite of split44print(", ".join(["Alice", "Bob", "Charlie"])) # "Alice, Bob, Charlie"45print("\n".join(["line1", "line2", "line3"])) # Multi-line string4647# partition β split into 3 parts at first occurrence48email = "user@example.com"49user, at, domain = email.partition("@")50print(f"User: {user}, Domain: {domain}")5152# ============================================================53# f-strings (Python 3.6+) β the modern way54# ============================================================55name = "Alice"56age = 3057balance = 1234567.8915859# Basic interpolation60print(f"Name: {name}, Age: {age}")6162# Expressions63print(f"Next year: {age + 1}")64print(f"Name length: {len(name)}")65print(f"Uppercase: {name.upper()}")6667# Format specifiers68print(f"Currency: ${balance:,.2f}") # $1,234,567.8969print(f"Percentage: {0.856:.1%}") # 85.6%70print(f"Padded: {42:08d}") # 0000004271print(f"Binary: {255:08b}") # 1111111172print(f"Hex: {255:#06x}") # 0x00ff73print(f"Scientific: {12345.6789:.2e}") # 1.23e+047475# Alignment76print(f"{'left':<20}|") # "left |"77print(f"{'right':>20}|") # " right|"78print(f"{'center':^20}|") # " center |"79print(f"{'padded':*^20}") # "*******padded*******"8081# Debugging with = (Python 3.8+)82x = 4283print(f"{x = }") # "x = 42"84print(f"{x * 2 = }") # "x * 2 = 84"85print(f"{name = !r}") # "name = 'Alice'" (with repr)8687# Multi-line f-strings88report = (89 f"User Report\n"90 f"{'='*40}\n"91 f"Name: {name}\n"92 f"Age: {age}\n"93 f"Balance: ${balance:,.2f}"94)95print(report)9697# ============================================================98# Regular expressions basics99# ============================================================100import re101102text = "Contact: alice@email.com or bob@company.org"103104# Find all email addresses105emails = re.findall(r'[\w.]+@[\w.]+\.\w+', text)106print(emails) # ['alice@email.com', 'bob@company.org']107108# Search (first match)109match = re.search(r'(\w+)@(\w+)\.\w+', text)110if match:111 print(f"Full match: {match.group()}") # alice@email.com112 print(f"Username: {match.group(1)}") # alice113 print(f"Domain: {match.group(2)}") # email114115# Substitution116cleaned = re.sub(r'\d{3}-\d{4}', 'XXX-XXXX', "Call 555-1234 or 555-5678")117print(cleaned) # "Call XXX-XXXX or XXX-XXXX"118119# Compiled patterns (faster for repeated use)120email_pattern = re.compile(r'[\w.]+@[\w.]+\.\w+')121all_emails = email_pattern.findall(text)122123# Common patterns124phone = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')125url = re.compile(r'https?://[\w./\-?=&]+')126ip = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')127128# ============================================================129# String encoding130# ============================================================131# str β bytes (encode)132text = "Hello, δΈη! π"133encoded = text.encode('utf-8')134print(encoded) # b'Hello, \xe4\xb8\x96\xe7\x95\x8c! \xf0\x9f\x8c\x8d'135136# bytes β str (decode)137decoded = encoded.decode('utf-8')138print(decoded) # "Hello, δΈη! π"139140# Length difference: str counts characters, bytes counts bytes141print(len(text)) # 12 characters142print(len(encoded)) # 18 bytes (CJK chars = 3 bytes, emoji = 4 bytes)
ποΈ Practice Exercise
Exercises:
Write a function
is_palindrome(s)that checks if a string is a palindrome, ignoring case, spaces, and punctuation. Example: "A man, a plan, a canal: Panama" β True.Implement a Caesar cipher: write
encrypt(text, shift)anddecrypt(text, shift)functions. Handle uppercase, lowercase, and non-alpha characters.Parse a log file line like
"2024-01-15 14:30:22 [ERROR] Database connection timeout"and extract the date, time, level, and message using both.split()and regex.Create a
format_table(headers, rows)function that displays data in a formatted ASCII table with column alignment. Use f-string formatting.Write a function that converts between cases:
snake_caseβcamelCaseβPascalCaseβkebab-case. Handle edge cases.Use regex to validate: email addresses, phone numbers (various formats), and URLs. Write comprehensive test cases for each.
β οΈ Common Mistakes
Using
+for string concatenation in loops β it creates new string objects each time (O(n^2)). Use''.join(list)or f-strings instead.Forgetting that strings are immutable β
s.upper()returns a NEW string, it doesn't modifys. You must assign:s = s.upper().Using
.format()or%formatting when f-strings are available (Python 3.6+). f-strings are faster, more readable, and more powerful.Not using raw strings for regex patterns β
re.search('\d+', text)should bere.search(r'\d+', text). Withoutr,\dis interpreted as an escape sequence.Confusing
str.find()(returns -1 if not found) withstr.index()(raises ValueError if not found). Usefind()when absence is expected,index()when absence is an error.
πΌ Interview Questions
π€ Mock Interview
Practice a live interview for Strings & Text Processing