Part 2 - Deep Dive: How Database Indexing Works (The B-Tree)

6. The “Handshake”: Why OS and Database Page Sizes Differ

When you run getconf PAGESIZE on Ubuntu, you see 4096 (4 KB). Yet, your database often uses 8 KB or 16 KB. If the RAM can process data quickly, why is there this mismatch?

The Generalist (Ubuntu) vs. The Specialist (Database)

Ubuntu (4 KB): An Operating System is a generalist. It manages everything from tiny config files to massive videos. If the OS used 16 KB pages, a 1 KB text file would still take up 16 KB of RAM—wasting 94% of that space. 4 KB is the “Goldilocks” size for general computing.
Database (8 KB+): A database is a specialist. It knows its files are massive. By using a larger Logical Page, it increases Fan-out (the number of pointers per page). This keeps the B-Tree “short and fat” rather than “tall and skinny,” reducing the number of disk reads.

How They Sync

There is no “clash” because these sizes are multiples. When the Database asks for one 8 KB page, Ubuntu simply fetches two 4 KB hardware blocks.

Layer	Unit	Size	Role
CPU Cache	Cache Line	64 Bytes	The ultra-fast “bite-sized” chunks the CPU actually chews on.
Ubuntu (OS)	Page	4 KB	The standard unit for moving data from Disk to RAM.
Database	Page/Block	8 KB	The logical unit for organizing the B-Tree index.

7. The Math of Pointer Density (6-byte vs. 8-byte)

You might wonder: If a 64-bit system uses 8-byte addresses, why do databases try to use 6-byte pointers?

It comes down to Page Density. A smaller pointer allows more entries to fit into a single 8 KB page. Let’s look at the math for a table indexing a 4-byte Integer ID:

Scenario A: Standard 64-bit Pointer (8 bytes)

Entry Size: 4B (ID) + 8B (Pointer) = 12 bytes
Entries per Page: $8192 / 12 \approx \mathbf{682}$
Tree Capacity (3 Levels): $682^3 \approx \mathbf{317}$ Million rows

Scenario B: Optimized Disk Pointer (6 bytes)

Entry Size: 4B (ID) + 6B (Pointer) = 10 bytes
Entries per Page: $8192 / 10 \approx \mathbf{819}$
Tree Capacity (3 Levels): $819^3 \approx \mathbf{550}$ Million rows

The Deep Insight: By “shaving off” 2 bytes from the pointer, we increased the capacity of a 3-level tree by over 230 million rows without adding a single extra “jump” (Disk I/O). In the database world, smaller is faster.

8. From Disk to RAM: The Buffer Pool

The database doesn’t read your index byte-by-byte.

On Disk: The index is stored in 8 KB chunks.
The Request: You search for ID 72.
The Fetch: The DB finds the 8 KB page containing 72 and copies it exactly into a dedicated section of RAM called the Buffer Pool.
The CPU Speed: Once that 8 KB page is in RAM, the CPU can scan those 819 entries in nanoseconds.

9. The Math of the 6-Byte Pointer: Why 256 Terabytes?

You noticed a critical detail: $2^{48}$ is a massive number, but why does it equal 256 Terabytes and not Terabits?

The Calculation

In computer architecture, addresses point to Bytes, not Bits.

1 Byte = 8 Bits.
6 Bytes = 48 Bits.
Total Addressable Units: $2^{48}$ bytes.

Let’s break the math down: $$2^{10} = 1,024 \text{ (1 Kilobyte)}$$ $$2^{20} = 1,048,576 \text{ (1 Megabyte)}$$ $$2^{30} = 1,073,741,824 \text{ (1 Gigabyte)}$$ $$2^{40} = 1,099,511,627,776 \text{ (1 Terabyte)}$$

So, $2^{48}$ is actually $2^8 \times 2^{40}$: $$256 \times 1 \text{ Terabyte} = \mathbf{256 \text{ Terabytes}}$$

Why Disk Pointers are 6 Bytes vs. RAM Pointers are 8 Bytes

This is a classic “Software vs. Hardware” distinction.

Feature	RAM Pointer (Hardware)	Disk Pointer (Database File)
Size	8 Bytes (64-bit)	6 Bytes (48-bit)
Reasoning	Hardware Mandate: Your 64-bit CPU is “wired” to read 8-byte chunks. It is a physical requirement of the memory bus.	Storage Optimization: Database engineers realized a single table rarely exceeds 256TB. Saving 2 bytes per pointer is a massive win.
The “Translation”	The CPU uses the address directly to talk to the RAM sticks.	When the DB loads a page into RAM, it “pads” the 6-byte pointer with two extra zeros to make it 8 bytes so the CPU can read it.

The Efficiency Gain

By using 6 bytes on disk instead of 8, we aren’t just saving disk space; we are increasing Page Density. As we calculated earlier, this “shaving” of 2 bytes allows us to fit ~20% more entries into every 8 KB page.

More entries per page = A wider tree = Fewer disk seeks = Faster Queries.

10. Summary: The Anatomy of a Search

To wrap up the journey from the query to the hardware:

The Query: You ask for WHERE ID = 72.
The Root/Internal Nodes: The DB navigates through 8 KB pages in RAM (or fetches them from disk).
The Leaf Node: The DB finds the entry for 72. It reads the 6-byte disk pointer.
The Padding: The DB software converts those 6 bytes into an 8-byte RAM address.
The Fetch: The CPU uses that 8-byte address to grab the actual row data from the Buffer Pool or Disk.

11. The “Postman” Analogy: Why we address Bytes, not Bits

One of the most common points of confusion in low-level systems is why a 48-bit pointer address equals 256 Terabytes and not Terabits. It all comes down to the Addressable Unit.

The Street vs. House Problem

Imagine you are designing a postal system for a massive city:

The Bits (The Houses): There are billions of houses. If you give every single house a unique, 50-digit GPS coordinate, your “Address Book” (the Index) would eventually become larger than the city itself!
The Byte (The Street): To be efficient, you group 8 houses together into one Street. Now, the postman only needs one address for the entire street.
The Rule: In computer architecture, the smallest “Street” we address is 1 Byte (8 bits).

Why not address every Bit?

If we tried to give every single bit its own unique address:

Index Bloat: The index would take up 8x more space.
Hardware Complexity: The CPU and RAM would have to be 8x more complex to manage those tiny, individual destinations.

Instead, the CPU asks for the address of a Byte, pulls all 8 bits into its “hands” (the registers), changes the bit it needs, and puts the whole Byte back.

The Final Math

Because every “Street” (Byte) has one unique pointer address:

Address size: 6 Bytes (48 bits).
Number of “Streets” we can find: $2^{48}$ unique addresses.
Total Capacity: Since each address holds 1 Byte, we can address $2^{48}$ Bytes.
Conversion: $2^{48} \text{ bytes} = \mathbf{256 \text{ Terabytes}}$.

Key Takeaway: If a pointer points to 8 bits (1 Byte) as its smallest unit, it is called Byte-Addressable. This is the standard for almost all modern CPUs, RAM, and Databases.

12. Summary Table: Bits, Bytes, and Pointers

Unit	What it is	Can it be addressed?
Bit	A single 0 or 1 (A “House”)	No. Too small; would make the index too large.
Byte	A group of 8 bits (A “Street”)	Yes. This is the standard unit for all pointers.
6-Byte Pointer	A 48-bit “Address”	Yes. It can find $2^{48}$ different “Streets” (Bytes).

13. The Great Trade-off: 6-Byte vs. 8-Byte Pointers

If your CPU and RAM already use 8-byte (64-bit) pointers, why does the database bother shrinking them to 6 bytes on the disk? Let’s look at the mathematical “gap” between these two choices.

The Astronomical Difference in Scale

When we increase a pointer by just 2 bytes (16 bits), the capacity doesn’t just grow—it explodes.

Pointer Size	Total Bits	Math ($2^n$ Bytes)	Max Addressable Capacity
6 Bytes	48 bits	$2^{48}$	256 Terabytes
8 Bytes	64 bits	$2^{64}$	16 Exabytes

How big is 16 Exabytes? To put that in perspective:

1 Petabyte = 1,024 Terabytes.
1 Exabyte = 1,024 Petabytes.
16 Exabytes is roughly 65,536 times larger than 256 Terabytes.

An 8-byte pointer can address more data than almost all the world’s data centers combined today. It is effectively “infinite.”

Why Databases Choose the “Small” 6-Byte Pointer

If 8 bytes is “future-proof” and “infinite,” why choose 6 bytes? It goes back to our 8 KB Page Density.

Let’s assume we are indexing a standard 4-byte Integer ID:

The 8-byte Strategy (Standard):
- Entry = 4B (ID) + 8B (Pointer) = 12 bytes
- Page Capacity = ~682 entries
The 6-byte Strategy (Optimized):
- Entry = 4B (ID) + 6B (Pointer) = 10 bytes
- Page Capacity = ~819 entries

The Decision: Density Over Infinity

Database engineers made a calculated bet:

The Bet: “Nobody is going to have a single table larger than 256 Terabytes anytime soon.”
The Reward: By shaving off those 2 bytes, they fit 20% more entries into every single 8 KB page.

Why does this matter? Fitting 20% more entries means the B-Tree is wider. A wider tree is a shorter tree. A shorter tree means the database needs fewer “jumps” to the disk to find your data. In the world of high-performance databases, 20% more density is worth more than 16 Exabytes of unused space.

14. Comparison Summary

Feature	6-Byte Disk Pointer	8-Byte RAM Pointer
Capacity	256 TB (Practical)	16 EB (Infinite)
Index Speed	Faster (Higher page density)	Slower (Lower page density)
Disk Space	Saved (2 bytes per row)	Wasted (Empty padding)
Where it lives	On your Hard Drive / SSD	Inside your CPU / RAM

Conclusion: The database uses “compressed” 6-byte pointers on the disk to maximize speed, and only “inflates” them to 8-byte pointers when the data enters the CPU to satisfy the hardware’s 64-bit requirement.

15. The “Zero-Padding” Trick: Why we don’t “Under-use” RAM

If the hardware is built for 64-bit (8-byte) “gulping,” aren’t we wasting the CPU’s potential by only giving it 48 bits from the disk? This is where the distinction between Storage Efficiency and Execution Speed becomes vital.

The Shipping Container vs. The Workbench

Think of your Disk as a shipping container and your RAM as a workbench:

On Disk (The Container): Space is expensive. If we store 64 bits but only use 48, we are shipping “empty air.” By packing only 48 bits (6 bytes), we fit 20% more pointers in the same 8 KB container.
In RAM (The Workbench): Once the data is on the workbench, the CPU must use its 64-bit “tools” to work.

How the CPU handles 48-bit Pointers

When the database pulls a 6-byte pointer into RAM, it performs a Zero-Padding operation. It places your 48 bits into a 64-bit “register” (a tiny pocket inside the CPU) and fills the remaining 16 bits with zeros:

[00000000 00000000] [48 bits of actual address]

Because a 64-bit CPU is designed to process 64 bits in a single clock cycle, it takes exactly the same amount of time to process this padded number as it would to process a full 64-bit number. We aren’t slowing down the CPU; we are just making the Disk work less.

6. The “Handshake”: Why OS and Database Page Sizes Differ#

The Generalist (Ubuntu) vs. The Specialist (Database)#

How They Sync#

7. The Math of Pointer Density (6-byte vs. 8-byte)#

Scenario A: Standard 64-bit Pointer (8 bytes)#

Scenario B: Optimized Disk Pointer (6 bytes)#

8. From Disk to RAM: The Buffer Pool#

9. The Math of the 6-Byte Pointer: Why 256 Terabytes?#

The Calculation#

Why Disk Pointers are 6 Bytes vs. RAM Pointers are 8 Bytes#

The Efficiency Gain#

10. Summary: The Anatomy of a Search#

11. The “Postman” Analogy: Why we address Bytes, not Bits#

The Street vs. House Problem#

Why not address every Bit?#

The Final Math#

12. Summary Table: Bits, Bytes, and Pointers#

13. The Great Trade-off: 6-Byte vs. 8-Byte Pointers#

The Astronomical Difference in Scale#

Why Databases Choose the “Small” 6-Byte Pointer#

The Decision: Density Over Infinity#

14. Comparison Summary#

15. The “Zero-Padding” Trick: Why we don’t “Under-use” RAM#

The Shipping Container vs. The Workbench#

How the CPU handles 48-bit Pointers#