The Fundamental Problem: The Full Table Scan
Imagine a table with 10 million rows. To find a specific ID without an index, the database must perform a Full Table Scan—checking every single row from 1 to 10,000,000. This is $O(n)$ complexity.
Indexing transforms this into a “search in a book” experience using a data structure called a B-Tree.
1. What is an Index?
At its core, an index is a separate, sorted data structure that maps a specific value (like an ID or Name) to a Pointer (the physical address of the row on disk).
The “Pointer” Concept
- The Index Record: Lives in the index tree. It says: “Value 72 is at GPS Coordinate X.”
- The Pointer: A physical address (File ID, Page ID, Slot No.) that tells the database exactly where to go on the hard drive to grab the full row.
2. The Anatomy of a B-Tree
A B-Tree (Balanced Tree) is a hierarchy of Pages. Instead of searching row-by-row, the database “jumps” through levels:
- Root Node (The Starting Point): Contains broad ranges (e.g., “Values 1-100 go to Page A, 101-200 go to Page B”).
- Internal Nodes (The Navigators): Narrow the search further (e.g., “Values 51-75 go to Page C”).
- Leaf Nodes (The Destination): These contain the actual sorted values and their corresponding Pointers to the disk.
Note: Leaf nodes are linked horizontally. This allows for Range Scans (e.g.,
WHERE age BETWEEN 20 AND 30) by finding the first value and simply walking across the neighbors.
3. The Memory Level: The 8KB “Page”
Database performance is dictated by how data is stored in memory. Databases move data in chunks called Pages.
- Standard Size: Usually 8 KB or 16 KB (to match the Operating System’s block size).
- The Fan-out Math: * If an Index Entry (Value + Pointer) is 10 bytes:
- One 8 KB Page can hold $\approx 800$ entries.
- Level 1 (Root): 1 Page (800 ranges)
- Level 2 (Internal): 800 Pages ($800 \times 800 = 640,000$ ranges)
- Level 3 (Leaf): 640,000 Pages ($\approx 512,000,000$ pointers)
Result: The database can find 1 specific row out of 500 million by reading only 3 pages (24 KB of data).
4. Numeric vs. String Indexing
Can you index strings (VARCHAR)? Yes. The B-Tree treats them lexicographically (dictionary order).
| Feature | Numeric (INT) | String (VARCHAR) |
|---|---|---|
| Size | Small (4-8 bytes) | Large (can be 100s of bytes) |
| Density | High (More entries per page) | Low (Fewer entries per page) |
| Speed | Faster (Shallow tree) | Slower (Taller tree) |
| Wildcards | N/A | Works for LIKE 'abc%', Fails for LIKE '%abc' |
5. Maintenance: The Cost of Speed
Indexes make SELECT fast, but they make INSERT/UPDATE slower. When you add a row, the database must:
- Write the data to the table.
- Update the B-Tree.
- If a Page is full, perform a Page Split (creating a new page and re-balancing the tree), which is a heavy I/O operation.
Summary Table: Index Types
| Type | Logic | Use Case |
|---|---|---|
| Clustered Index | The leaf node is the actual data row. | Primary Keys. |
| Non-Clustered | The leaf node is a pointer to the data. | Secondary lookups (Email, Name). |