Database Design & Indexing


What is Database Design?

Database Design = Process of organizing data to reduce redundancy and improve data integrity

Key Goals:

  • Data Integrity - Accurate and consistent data

  • Eliminate Redundancy - Store data only once

  • Optimize Performance - Fast queries and efficient storage

  • Maintainability - Easy to update and extend

  • Scalability - Handle growing data volumes

Design Process Overview

1. Requirements Analysis    → What data do we need?
2. Conceptual Design        → ER Diagram (entities and relationships)
3. Logical Design           → Tables, columns, relationships
4. Normalization            → Reduce redundancy (1NF to 3NF+)
5. Physical Design          → Indexes, partitioning, storage
6. Implementation           → Create tables, constraints, indexes
7. Testing & Optimization   → Query tuning, index optimization

Database Normalization

What is Normalization?

Normalization = Process of organizing data to minimize redundancy and dependency

Benefits:

  • ✅ Eliminates data redundancy

  • ✅ Prevents update anomalies

  • ✅ Improves data integrity

  • ✅ Reduces storage space

Drawbacks:

  • ❌ More tables = more JOINs

  • ❌ Can reduce read performance

  • ❌ More complex queries

Before Normalization Example


First Normal Form (1NF)

Rule: Eliminate repeating groups and ensure atomic values

Requirements:

  1. Each column contains atomic (indivisible) values

  2. Each column contains values of a single type

  3. Each column has a unique name

  4. Order doesn't matter

Example: Converting to 1NF


Second Normal Form (2NF)

Rule: Must be in 1NF AND eliminate partial dependencies

Partial Dependency: Non-key column depends on part of a composite primary key

Requirements:

  1. Must be in 1NF

  2. No partial dependencies (all non-key columns depend on the entire primary key)

Example: Converting to 2NF


Third Normal Form (3NF)

Rule: Must be in 2NF AND eliminate transitive dependencies

Transitive Dependency: Non-key column depends on another non-key column

Requirements:

  1. Must be in 2NF

  2. No transitive dependencies (non-key columns depend only on the primary key)

Example: Converting to 3NF

Real-World 3NF Example


Boyce-Codd Normal Form (BCNF)

Rule: Stronger version of 3NF - every determinant must be a candidate key

Requirements:

  1. Must be in 3NF

  2. For every functional dependency (X → Y), X must be a superkey

Example: Converting to BCNF


Fourth Normal Form (4NF)

Rule: Must be in BCNF AND eliminate multi-valued dependencies

Multi-Valued Dependency: One attribute determines multiple independent sets of values

Example: Converting to 4NF


Fifth Normal Form (5NF)

Rule: Must be in 4NF AND eliminate join dependencies

Join Dependency: Data can be reconstructed by joining multiple tables

Example: 5NF


Normalization Summary

Normal Form
Rule
Example Violation

1NF

Atomic values, no repeating groups

Phones: "555-1234, 555-5678"

2NF

1NF + No partial dependencies

ProductName depends only on ProductId in composite key

3NF

2NF + No transitive dependencies

DepartmentName depends on DepartmentId, not EmployeeId

BCNF

3NF + Every determinant is a candidate key

InstructorId → InstructorName, but InstructorId not a key

4NF

BCNF + No multi-valued dependencies

Skills and Certifications independent

5NF

4NF + No join dependencies

Rare in practice

When to Denormalize?

Denormalization = Intentionally adding redundancy for performance

When to consider:

  • ✅ Read-heavy applications (reporting, analytics)

  • ✅ Complex joins hurting performance

  • ✅ Calculated/aggregated values accessed frequently

  • ✅ Immutable historical data

Common denormalization patterns:


Entity-Relationship (ER) Diagrams

Relationship Types

One-to-One (1:1)

One-to-Many (1:N)

Many-to-Many (M:N)

Self-Referencing Relationships


Constraints and Data Integrity

Primary Keys

Foreign Keys

Unique Constraints

Check Constraints

Default Constraints


Indexing Fundamentals

What is an Index?

Index = Data structure that improves query performance at the cost of write performance and storage

Analogy: Like a book index - helps find information quickly without reading entire book

Benefits:

  • ✅ Faster SELECT queries

  • ✅ Faster sorting (ORDER BY)

  • ✅ Faster filtering (WHERE)

  • ✅ Faster joins

Costs:

  • ❌ Slower INSERT, UPDATE, DELETE

  • ❌ Additional storage space

  • ❌ Index maintenance overhead


Clustered Index

Clustered Index = Determines the physical order of data in the table

Key Points:

  • 📌 Only ONE clustered index per table

  • 📌 Table data is sorted by clustered index key

  • 📌 Leaf nodes contain actual data rows

  • 📌 Primary key creates clustered index by default

Choosing Clustered Index Column

Good candidates:

  • ✅ Primary key (most common)

  • ✅ Frequently used in range queries (dates, IDs)

  • ✅ Unique or near-unique values

  • ✅ Narrow key (int better than varchar)

  • ✅ Ever-increasing values (IDENTITY, timestamps)

Avoid:

  • ❌ Wide keys (multiple large columns)

  • ❌ Frequently updated columns

  • ❌ Random values (GUIDs without NEWSEQUENTIALID)


Non-Clustered Index

Non-Clustered Index = Separate structure that points to data

Key Points:

  • 📌 Up to 999 non-clustered indexes per table (SQL Server 2016+)

  • 📌 Leaf nodes contain pointer to data (or clustered index key)

  • 📌 Can have multiple per table

  • 📌 Can be created on views

Index Column Order Matters


Covering Index (INCLUDE clause)

Covering Index = Index contains all columns needed by query (no table lookup required)


Filtered Index

Filtered Index = Index on subset of rows (SQL Server 2008+)


Columnstore Index

Columnstore Index = Stores data by column (not row) for analytics


Index Best Practices

1. Index Selectivity

2. Index Maintenance

3. Index Naming Conventions

4. Over-Indexing vs Under-Indexing

5. Finding Missing Indexes

6. Finding Unused Indexes


Query Optimization Fundamentals

Execution Plans

Execution Plan = SQL Server's roadmap for executing a query

Types:

  1. Estimated Plan - What SQL Server plans to do (CTRL+L)

  2. Actual Plan - What SQL Server actually did (CTRL+M)

Reading Execution Plans

Read from right to left, top to bottom

Common operators:

  • Table Scan 🔴 - Reads entire table (slow for large tables)

  • Clustered Index Scan 🟡 - Reads all rows in clustered index

  • Index Seek 🟢 - Efficiently finds specific rows (fast)

  • Index Scan 🟡 - Reads all rows in index

  • Key Lookup 🟡 - Extra table lookup after index seek

  • Nested Loops 🟢 - Good for small datasets

  • Hash Match 🟡 - Good for large datasets

  • Merge Join 🟢 - Best when inputs are sorted

Cost percentages:

  • Higher percentage = more expensive operation

  • Focus optimization on highest cost operators


Query Optimization Techniques

1. SELECT Only Required Columns

2. Use WHERE Instead of HAVING

3. Avoid Functions on Indexed Columns

4. Use EXISTS Instead of IN with Subqueries

5. Use JOIN Instead of Subqueries (Usually)

6. Avoid SELECT DISTINCT (Use GROUP BY if Possible)

7. Use UNION ALL Instead of UNION

8. Avoid Wildcard at Start of LIKE

9. Use Appropriate JOIN Types

10. Optimize OR Conditions


Common Query Anti-Patterns

1. N+1 Query Problem

2. Using Scalar Functions in SELECT

3. Using Cursors Instead of Set-Based Operations

4. Implicit Data Type Conversion

5. Using NOT IN with NULLs


Statistics and Query Optimizer

Database Statistics

Statistics = Metadata about data distribution used by query optimizer

Query Hints


Performance Monitoring

Key Performance Metrics


Best Practices Summary

Database Design

  1. ✅ Normalize to 3NF, denormalize only when necessary

  2. ✅ Use appropriate data types (don't over-allocate)

  3. ✅ Add proper constraints (PK, FK, CHECK, UNIQUE)

  4. ✅ Use IDENTITY for auto-incrementing keys

  5. ✅ Consider soft deletes instead of hard deletes

Indexing

  1. ✅ Create indexes on foreign keys

  2. ✅ Create indexes on frequently filtered columns (WHERE, JOIN)

  3. ✅ Use covering indexes for frequently accessed columns

  4. ✅ Don't over-index (2-5 indexes per table is usually enough)

  5. ✅ Maintain indexes regularly (rebuild/reorganize)

Query Writing

  1. ✅ Select only required columns (avoid SELECT *)

  2. ✅ Use WHERE before HAVING

  3. ✅ Avoid functions on indexed columns

  4. ✅ Use EXISTS over IN for subqueries

  5. ✅ Use UNION ALL over UNION when duplicates are OK

  6. ✅ Use appropriate JOIN types

  7. ✅ Avoid cursors (use set-based operations)

  8. ✅ Test queries with production-like data volumes

Performance

  1. ✅ Analyze execution plans

  2. ✅ Monitor query performance regularly

  3. ✅ Keep statistics updated

  4. ✅ Use connection pooling

  5. ✅ Consider caching for frequently accessed data


Quick Decision Trees

Should I Create an Index?

Clustered vs Non-Clustered?

Query Optimization Checklist


Guide Complete! This comprehensive database design and optimization guide covers normalization, ER diagrams, constraints, indexing strategies, query optimization, execution plans, and performance tuning. Master these concepts to design efficient databases and write high-performance queries! 🚀

Last updated