Comprehensive SQL questions covering window functions, joins, aggregations, and query optimization.
| sale_id | city | sale_date | amount |
|---|---|---|---|
| 1 | Mumbai | 2024-01-10 | 5000 |
| 2 | Delhi | 2024-01-15 | 7000 |
| 3 | Bangalore | 2024-01-20 | 10000 |
| 4 | Chennai | 2024-02-05 | 3000 |
| 5 | Mumbai | 2024-02-08 | 9000 |
SELECT sale_month, city, total_sales
FROM (
SELECT
DATE_FORMAT(sale_date, '%Y-%m') AS sale_month,
city,
SUM(amount) AS total_sales,
ROW_NUMBER() OVER (PARTITION BY DATE_FORMAT(sale_date, '%Y-%m') ORDER BY SUM(amount) DESC) AS rn
FROM sales_table
GROUP BY sale_month, city
) ranked
WHERE rn <= 3;
Explanation: This query uses window functions to rank cities by total sales within each month. The ROW_NUMBER() function assigns ranks, and we filter for the top 3 cities per month using WHERE rn <= 3.
| sale_id | city | sale_date | amount |
|---|---|---|---|
| 1 | Mumbai | 2024-01-10 | 5000 |
| 2 | Delhi | 2024-01-15 | 7000 |
| 3 | Mumbai | 2024-01-20 | 3000 |
| 4 | Delhi | 2024-02-05 | 6000 |
| 5 | Mumbai | 2024-02-08 | 8000 |
SELECT
sale_id,
city,
sale_date,
amount,
SUM(amount) OVER (PARTITION BY city ORDER BY sale_date) AS running_total
FROM sales_data
ORDER BY city, sale_date;
Explanation: The SUM(amount) OVER (PARTITION BY city ORDER BY sale_date) creates a running total for each city. The window function accumulates the sum ordered by sale date within each city partition.
| emp_id | emp_name | salary | department |
|---|---|---|---|
| 1 | Ravi | 70000 | HR |
| 2 | Priya | 90000 | IT |
| 3 | Kunal | 85000 | Finance |
| 4 | Aisha | 60000 | IT |
| 5 | Rahul | 95000 | HR |
SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
Explanation: This query finds the maximum salary that is less than the overall maximum salary. The subquery finds the highest salary, and the outer query finds the maximum of all remaining salaries.
| emp_id | emp_name | salary | department |
|---|---|---|---|
| 1 | Neha | 50000 | HR |
| 2 | Ravi | 70000 | IT |
| 3 | Aman | 50000 | HR |
| 4 | Pooja | 90000 | IT |
| 5 | Karan | 70000 | IT |
SELECT *
FROM employee_salary e1
WHERE EXISTS (
SELECT 1
FROM employee_salary e2
WHERE e1.department = e2.department
AND e1.salary = e2.salary
AND e1.emp_id <> e2.emp_id
);
Explanation: This query uses a correlated subquery with EXISTS to find employees who have at least one other employee in the same department with the same salary. The condition e1.emp_id <> e2.emp_id ensures we don't match an employee with themselves.
| user_id | user_name | |
|---|---|---|
| 1 | Sameer | sameer@gmail.com |
| 2 | Anjali | anjali@gmail.com |
| 3 | Sameer | sameer@gmail.com |
| 4 | Rohan | rohan@gmail.com |
| 5 | Rohan | rohan@gmail.com |
SELECT user_name, email, COUNT(*)
FROM users
GROUP BY user_name, email
HAVING COUNT(*) > 1;
Explanation: This query groups records by user_name and email, then uses HAVING COUNT(*) > 1 to filter only the groups that have more than one occurrence, effectively finding duplicates.
| user_id | user_name | |
|---|---|---|
| 1 | Sameer | sameer@gmail.com |
| 2 | Anjali | anjali@gmail.com |
| 3 | Sameer | sameer@gmail.com |
| 4 | Rohan | rohan@gmail.com |
| 5 | Rohan | rohan@gmail.com |
DELETE FROM users
WHERE user_id NOT IN (
SELECT MIN(user_id)
FROM users
GROUP BY user_name, email
);
Explanation: This query keeps the record with the minimum user_id for each unique combination of user_name and email, and deletes all other duplicate records. The subquery identifies which IDs to keep, and the outer DELETE removes everything else.
| sale_id | city | sale_date | amount |
|---|---|---|---|
| 1 | Mumbai | 2024-01-10 | 5000 |
| 2 | Delhi | 2024-02-15 | 7000 |
| 3 | Mumbai | 2024-01-20 | 3000 |
| 4 | Delhi | 2024-03-05 | 6000 |
| 5 | Mumbai | 2024-02-08 | 8000 |
SELECT
city,
SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-01' THEN amount ELSE 0 END) AS Jan_2024,
SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-02' THEN amount ELSE 0 END) AS Feb_2024,
SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-03' THEN amount ELSE 0 END) AS Mar_2024
FROM sales_data
GROUP BY city;
Explanation: This query uses conditional aggregation with CASE statements to pivot the data. Each month becomes a separate column, and the sum of amounts for that month is calculated for each city.
| order_id | customer_id | order_date | amount |
|---|---|---|---|
| 1 | 101 | 2024-01-10 | 1000 |
| 2 | 102 | 2024-02-15 | 2000 |
| 3 | 101 | 2024-03-20 | 1500 |
| 4 | 103 | 2024-04-05 | 2500 |
| 5 | 101 | 2024-05-08 | 3000 |
SELECT customer_id
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
GROUP BY customer_id
HAVING COUNT(*) >= 3;
Explanation: This query filters orders from the last 6 months using DATE_SUB(CURDATE(), INTERVAL 6 MONTH), groups by customer, and uses HAVING COUNT(*) >= 3 to find customers with at least 3 orders in that period.
Breaking tables into smaller, related tables to reduce redundancy and improve data integrity. Use for OLTP, frequent updates, data consistency.
Combining tables to reduce joins and improve read performance. Use for OLAP, reporting, analytics.
Determines physical order of data, one per table, fast for range queries.
Separate structure, can have many per table, fast for lookups. Indexes speed up SELECTs but slow down writes.
SELECT MAX(salary) AS second_highest_salary
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
-- Alternatively, use ROW_NUMBER() or DENSE_RANK() for ties.
Explanation: Finds the maximum salary that is less than the overall maximum. Alternative methods include using ROW_NUMBER() or DENSE_RANK() window functions for handling ties.
Use IS NULL or COALESCE() to manage NULLs. In joins, NULLs can lead to unmatched rows. For example, LEFT JOIN returns NULLs for missing matches. Use COALESCE(column, 'default') to replace NULLs.
SELECT pipeline_name, MONTHNAME(run_date) AS month, COUNT(*) AS failures,
MAX(COUNT(*)) OVER () AS max_failures
FROM pipeline_log
WHERE status='FAILED' AND MONTH(run_date)=MONTH(CURDATE())
GROUP BY pipeline_name, MONTHNAME(run_date);
Explanation: Uses MONTHNAME() to get the month name, counts failures per pipeline, and uses a window function to find the maximum failures across all pipelines.
SELECT id,
AVG(CASE WHEN item_name='Apple' THEN value END) AS Apple,
AVG(CASE WHEN item_name='Orange' THEN value END) AS Orange,
AVG(CASE WHEN item_name='Banana' THEN value END) AS Banana
FROM input_table
GROUP BY id;
Explanation: Uses conditional aggregation with CASE statements to pivot data, creating separate columns for each item type.
SELECT region, salesamount
FROM (
SELECT region, salesamount,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY salesamount DESC) AS rn
FROM sales
) t
WHERE rn = 3;
Explanation: Uses ROW_NUMBER() window function partitioned by region and ordered by salesamount descending, then filters for the 3rd row in each partition.
All rows from A, matched/unmatched from B.
All rows from B, matched/unmatched from A.
Only rows where both A and B are not null and match.
SELECT COUNT(DISTINCT customer_id) AS churned_customers,
(SELECT COUNT(DISTINCT customer_id) FROM customers
WHERE signup_date BETWEEN DATEADD(month, -6, GETDATE()) AND GETDATE()) AS total_customers,
COUNT(DISTINCT customer_id)*1.0 /
(SELECT COUNT(DISTINCT customer_id) FROM customers
WHERE signup_date BETWEEN DATEADD(month, -6, GETDATE()) AND GETDATE()) AS churn_rate
FROM customers
WHERE last_active_date < DATEADD(month, -6, GETDATE());
Explanation: Calculates churn rate by dividing churned customers (inactive for 6+ months) by total customers who signed up in the last 6 months.
SELECT room_type,
COUNT(CASE WHEN status='cancelled' THEN 1 END)*1.0/COUNT(*) AS cancellation_rate
FROM bookings
WHERE stay_nights >= 2
AND booking_date >= DATEADD(month, -6, CURRENT_DATE)
GROUP BY room_type;
Explanation: Filters bookings with 2+ night stays from the last 6 months, then calculates the proportion of cancelled bookings per room type.
SELECT country, device_type,
AVG(CASE WHEN booking_id IS NOT NULL THEN 1 ELSE 0 END) AS avg_conversion_rate
FROM searches
LEFT JOIN bookings ON searches.user_id = bookings.user_id AND searches.session_id = bookings.session_id
GROUP BY country, device_type;
Explanation: Joins search events with confirmed bookings, then calculates the conversion rate as the proportion of searches that resulted in bookings, grouped by country and device.
WITH region_avg AS (
SELECT region, AVG(booking_rate) AS avg_rate
FROM properties
WHERE booking_date >= DATEADD(month, -12, CURRENT_DATE)
GROUP BY region
)
SELECT p.property_id
FROM properties p
JOIN region_avg r ON p.region = r.region
WHERE p.booking_rate < r.avg_rate
AND p.booking_date >= DATEADD(month, -12, CURRENT_DATE);
Explanation: Uses a CTE to calculate regional average booking rates, then identifies properties performing below their region's average over the past year.
WITH hourly AS (
SELECT DATE_TRUNC('hour', booking_time) AS hour, COUNT(*) AS bookings
FROM bookings
GROUP BY hour
),
avg_hourly AS (
SELECT AVG(bookings) AS avg_bookings FROM hourly
)
SELECT h.hour, h.bookings
FROM hourly h, avg_hourly a
WHERE h.bookings > 1.5 * a.avg_bookings;
Explanation: Calculates hourly booking counts, compares each hour to the overall hourly average, and identifies hours exceeding 150% of the average.
Data distribution across shards makes cross-shard aggregation complex. Challenges include data consistency, network latency, partial failures, and the need for distributed queries or data pipelines to aggregate metrics globally.
Normalize all timestamps to a standard (e.g., UTC) during ingestion or in queries. Store original and converted timestamps if local analysis is needed. Use timezone-aware functions to aggregate by global day.
Normalize for OLTP systems to reduce redundancy and ensure integrity. Denormalize in OLAP/data warehouses for faster analytics. Use views or materialized views to combine normalized data for reporting.
Use transactions with isolation levels (e.g., SERIALIZABLE), optimistic concurrency control (row versioning/timestamps), or database locking mechanisms to prevent lost updates and ensure consistency.
Window functions allow calculations (e.g., running totals, ranks) across rows without collapsing them, enabling complex analytics (like moving averages or percentiles) that GROUP BY cannot provide without subqueries.
Compares each employee's salary to the overall average.
SELECT * FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);
Finds employees in John's department, excluding John.
SELECT name FROM employees WHERE department_id = (SELECT department_id FROM employees WHERE name = 'John');
Gets the highest salary less than the maximum.
SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);
Groups orders by customer and filters for those with >5.
SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(*) > 5;
Simple aggregation by customer.
SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id;
Filters by join date within last 6 months.
SELECT * FROM employees WHERE join_date >= DATEADD(MONTH, -6, GETDATE());
Aggregates sales by product.
SELECT product_id, SUM(sales_amount) AS total_sales FROM sales GROUP BY product_id;
Finds products with no matching sales.
SELECT p.product_id FROM products p LEFT JOIN sales s ON p.product_id = s.product_id WHERE s.product_id IS NULL;
Example: 10% raise for 'Excellent' performers. Adjust logic as needed.
UPDATE employees SET salary = salary * 1.10 WHERE performance_rating = 'Excellent';
Keeps the first occurrence, deletes others. Replace col1, col2, ... with relevant columns.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY col1, col2, ... ORDER BY (SELECT NULL)) AS rn
FROM table_name
)
DELETE FROM cte WHERE rn > 1;
Uses window function for ranking.
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk FROM employees
) t
WHERE rnk = 2;
Finds customers with no orders in the last year.
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id AND o.order_date >= DATEADD(YEAR, -1, GETDATE())
WHERE o.order_id IS NULL
ORDER BY c.customer_id
LIMIT 10;
Uses LAG() for previous year's revenue and calculates growth.
SELECT category_id, year, revenue,
(revenue - LAG(revenue) OVER (PARTITION BY category_id ORDER BY year)) / NULLIF(LAG(revenue) OVER (PARTITION BY category_id ORDER BY year), 0) AS growth_rate
FROM (
SELECT category_id, EXTRACT(YEAR FROM sale_date) AS year, SUM(revenue) AS revenue
FROM sales
GROUP BY category_id, year
) t;
Counts presence in each table, filters for exactly two.
SELECT id
FROM (
SELECT id,
(CASE WHEN a.id IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN b.id IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN c.id IS NOT NULL THEN 1 ELSE 0 END) AS cnt
FROM a FULL OUTER JOIN b ON a.id = b.id
FULL OUTER JOIN c ON a.id = c.id OR b.id = c.id
) t
WHERE cnt = 2;
Uses window function for median (SQL Server 2012+, PostgreSQL).
SELECT region,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sales_amount) AS median_sales
FROM sales
GROUP BY region;
Calculates the proportion of customers who made repeat purchases after their first order.
WITH cohort AS (
SELECT customer_id, MIN(order_date) AS first_order
FROM orders
GROUP BY customer_id
), retained AS (
SELECT c.customer_id
FROM cohort c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= DATEADD(MONTH, 1, c.first_order)
)
SELECT COUNT(*) * 1.0 / (SELECT COUNT(*) FROM cohort) AS retention_rate FROM retained;
Groups by all columns that define a duplicate.
SELECT col1, col2, COUNT(*) AS duplicate_count FROM table_name GROUP BY col1, col2 HAVING COUNT(*) > 1;
Finds employees with long tenure and no promotion record.
SELECT e.*
FROM employees e
LEFT JOIN promotions p ON e.employee_id = p.employee_id
WHERE DATEDIFF(YEAR, e.join_date, GETDATE()) > 5 AND p.employee_id IS NULL;
Groups login events by date and counts distinct users.
SELECT CAST(login_timestamp AS DATE) AS login_date, COUNT(DISTINCT user_id) AS active_users
FROM user_logins
GROUP BY CAST(login_timestamp AS DATE)
ORDER BY login_date;
Uses DENSE_RANK() to rank transactions within each user and filters for the 2nd rank.
SELECT user_id, transaction_amount
FROM (
SELECT user_id, transaction_amount,
DENSE_RANK() OVER (PARTITION BY user_id ORDER BY transaction_amount DESC) AS rnk
FROM transactions
) ranked_transactions
WHERE rnk = 2;
Generates a series of expected hours and LEFT JOINs with actual data to find missing hours. This requires a way to generate a series of dates/times.
WITH all_hours AS (
SELECT GENERATE_SERIES('2024-01-01 00:00:00'::timestamp, '2024-01-01 23:00:00'::timestamp, '1 hour') AS hour_start
), actual_hours AS (
SELECT DATE_TRUNC('hour', event_timestamp) AS hour_start, COUNT(*) AS event_count
FROM event_logs
WHERE event_timestamp >= '2024-01-01 00:00:00' AND event_timestamp < '2024-01-02 00:00:00'
GROUP BY 1
)
SELECT ah.hour_start AS missing_hour
FROM all_hours ah
LEFT JOIN actual_hours act ON ah.hour_start = act.hour_start
WHERE act.event_count IS NULL
ORDER BY missing_hour;
Uses a subquery to find the minimum purchase date for each user, then calculates the difference from the current date.
SELECT user_id, MIN(purchase_date) AS first_purchase_date,
DATEDIFF(CURRENT_DATE, MIN(purchase_date)) AS days_since_first_purchase
FROM purchases
GROUP BY user_id;
Delta Lake automatically tracks schema changes in its transaction log. You can query the history to see schema evolution or use `mergeSchema` option during writes. Detecting *unintended* changes would involve comparing the current schema to a predefined expected schema.
# To see schema history
spark.sql("DESCRIBE HISTORY delta.`/path/to/delta/table`").show()
# To compare current schema with expected (conceptual)
# current_schema = spark.read.format("delta").load("/path/to/delta/table").schema
# expected_schema = StructType([...]) # Define your expected schema
# if current_schema != expected_schema:
# print("Schema mismatch detected!")
Using an `INNER JOIN` automatically filters out rows where the join key is NULL in either table, ensuring only matched records are returned.
SELECT t.*, p.product_name
FROM transactions t
INNER JOIN products p ON t.product_id = p.product_id;
-- This implicitly handles null foreign keys by only returning matched rows.
-- If you need non-matching transactions, use LEFT JOIN and then filter out NULLs:
-- SELECT t.*, p.product_name FROM transactions t LEFT JOIN products p ON t.product_id = p.product_id WHERE p.product_id IS NOT NULL;
Joins users and premium_subscriptions, then filters based on the date difference.
SELECT u.user_id, u.signup_date, ps.upgrade_date
FROM users u
JOIN premium_subscriptions ps ON u.user_id = ps.user_id
WHERE ps.upgrade_date <= DATEADD(DAY, 7, u.signup_date);
Uses a window function with `COUNT(DISTINCT ...)` to get a running count of unique products bought by each customer over time.
SELECT
customer_id,
purchase_date,
product_id,
COUNT(DISTINCT product_id) OVER (PARTITION BY customer_id ORDER BY purchase_date ROWS UNBOUNDED PRECEDING) AS cumulative_distinct_products
FROM purchases
ORDER BY customer_id, purchase_date;
Uses a CTE to calculate average regional spending, then joins back to customers to filter those above average.
WITH RegionalAvgSpend AS (
SELECT c.region, AVG(t.amount) AS avg_region_spend
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
GROUP BY c.region
)
SELECT c.customer_id, c.customer_name, SUM(t.amount) AS total_spend, ras.avg_region_spend
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
JOIN RegionalAvgSpend ras ON c.region = ras.region
GROUP BY c.customer_id, c.customer_name, c.region, ras.avg_region_spend
HAVING SUM(t.amount) > ras.avg_region_spend;
Groups by all columns and counts occurrences. Filters for counts greater than 1.
SELECT col1, col2, col3, COUNT(*)
FROM ingestion_table
GROUP BY col1, col2, col3 -- Include all columns that define uniqueness
HAVING COUNT(*) > 1;
Calculates daily revenue, then uses `LAG()` to get the previous day's revenue to compute growth percentage.
WITH DailyRevenue AS (
SELECT CAST(sale_date AS DATE) AS sale_day, SUM(amount) AS daily_revenue
FROM sales
GROUP BY CAST(sale_date AS DATE)
)
SELECT sale_day, daily_revenue,
LAG(daily_revenue, 1, 0) OVER (ORDER BY sale_day) AS previous_day_revenue,
(daily_revenue - LAG(daily_revenue, 1, 0) OVER (ORDER BY sale_day)) * 100.0 / NULLIF(LAG(daily_revenue, 1, 0) OVER (ORDER BY sale_day), 0) AS growth_percent
FROM DailyRevenue
ORDER BY sale_day;
Calculates monthly sales, then uses `LAG()` to compare current month's sales with the previous two months.
WITH MonthlySales AS (
SELECT product_id,
DATE_TRUNC('month', sale_date) AS sales_month,
SUM(amount) AS monthly_sales
FROM sales
GROUP BY product_id, DATE_TRUNC('month', sale_date)
), LaggedSales AS (
SELECT product_id, sales_month, monthly_sales,
LAG(monthly_sales, 1) OVER (PARTITION BY product_id ORDER BY sales_month) AS prev_month_sales_1,
LAG(monthly_sales, 2) OVER (PARTITION BY product_id ORDER BY sales_month) AS prev_month_sales_2
FROM MonthlySales
)
SELECT DISTINCT product_id
FROM LaggedSales
WHERE monthly_sales < prev_month_sales_1
AND prev_month_sales_1 < prev_month_sales_2;
Groups logins by user and week, then filters for users meeting the login count criteria over the specified period.
WITH WeeklyLogins AS (
SELECT user_id,
DATE_TRUNC('week', login_timestamp) AS login_week,
COUNT(*) AS weekly_login_count
FROM user_logins
WHERE login_timestamp >= DATEADD(month, -2, CURRENT_DATE)
GROUP BY user_id, DATE_TRUNC('week', login_timestamp)
)
SELECT DISTINCT user_id
FROM WeeklyLogins
WHERE weekly_login_count >= 3
GROUP BY user_id
HAVING COUNT(login_week) >= 8; -- Assuming 8 weeks in 2 months for consistent check
Counts logins per user in the current quarter, then ranks them based on this count.
SELECT user_id, login_count,
RANK() OVER (ORDER BY login_count DESC) AS login_rank
FROM (
SELECT user_id, COUNT(*) AS login_count
FROM user_logins
WHERE login_timestamp >= DATE_TRUNC('quarter', CURRENT_DATE)
AND login_timestamp < DATE_TRUNC('quarter', CURRENT_DATE) + INTERVAL '3 month'
GROUP BY user_id
) AS quarterly_logins
ORDER BY login_rank;
Groups purchases by user, product, and date, then filters for groups with more than one purchase.
SELECT user_id, product_id, CAST(purchase_date AS DATE) AS purchase_day, COUNT(*) AS purchase_count
FROM purchases
GROUP BY user_id, product_id, CAST(purchase_date AS DATE)
HAVING COUNT(*) > 1;
This is a conceptual approach for a partitioned table. You'd typically have a `load_timestamp` or similar column to identify when data *arrived* vs. its `event_timestamp`. Late data is identified by `event_timestamp` being in a 'closed' partition (e.g., previous month) but `load_timestamp` is current. Deletion then targets these specific records.
-- Identify late-arriving data for the current month's partition
SELECT *
FROM your_partitioned_table
WHERE CAST(event_timestamp AS DATE) < DATE_TRUNC('month', CURRENT_DATE) -- Data belongs to previous month
AND CAST(load_timestamp AS DATE) >= DATE_TRUNC('month', CURRENT_DATE); -- But loaded this month
-- To delete (use with extreme caution after careful testing):
-- DELETE FROM your_partitioned_table
-- WHERE CAST(event_timestamp AS DATE) < DATE_TRUNC('month', CURRENT_DATE)
-- AND CAST(load_timestamp AS DATE) >= DATE_TRUNC('month', CURRENT_DATE);
Calculates profit margin for each product, then ranks and selects the top 5.
SELECT product_id, product_name, profit_margin
FROM (
SELECT p.product_id, p.product_name,
(SUM(s.sales_amount) - SUM(s.cost_amount)) * 1.0 / NULLIF(SUM(s.sales_amount), 0) AS profit_margin,
ROW_NUMBER() OVER (ORDER BY (SUM(s.sales_amount) - SUM(s.cost_amount)) * 1.0 / NULLIF(SUM(s.sales_amount), 0) DESC) AS rnk
FROM products p
JOIN sales s ON p.product_id = s.product_id
GROUP BY p.product_id, p.product_name
) ranked_products
WHERE rnk <= 5
ORDER BY profit_margin DESC;
Uses a CTE to get daily revenue, then applies window functions to calculate current and previous 30-day rolling sums.
WITH DailyRevenue AS (
SELECT CAST(sale_date AS DATE) AS sale_day, SUM(amount) AS daily_revenue
FROM sales
GROUP BY CAST(sale_date AS DATE)
)
SELECT sale_day,
SUM(daily_revenue) OVER (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS current_30day_revenue,
LAG(SUM(daily_revenue) OVER (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW), 30) AS previous_30day_revenue
FROM DailyRevenue
ORDER BY sale_day;
Uses `CASE` statement to check if the transaction time falls outside a defined business hour range (e.g., 9 AM to 5 PM).
SELECT transaction_id, transaction_timestamp,
CASE
WHEN CAST(transaction_timestamp AS TIME) < '09:00:00' OR CAST(transaction_timestamp AS TIME) >= '17:00:00'
THEN 'Outside Business Hours'
ELSE 'During Business Hours'
END AS business_hours_flag
FROM transactions;
A broadcast join hint (syntax varies by database, e.g., `/*+ BROADCAST(JOIN(lookup_table) */` in some SQL dialects, or implicitly handled by optimizers for small tables) tells the optimizer to broadcast the smaller table to all nodes, avoiding a shuffle of the larger table. This is highly efficient for joining a large fact table with a small dimension table.
-- Example for SQL Server (using OPTION)
SELECT /*+ HINT(OPTION (HASH JOIN, FORCE ORDER)) */
f.fact_column, d.dimension_column
FROM
FactTable f
INNER JOIN
DimensionTable d ON f.dim_key = d.dim_key;
-- Example for Spark SQL (using hint)
SELECT /*+ BROADCAST(d) */ f.fact_column, d.dimension_column
FROM
FactTable f
INNER JOIN
DimensionTable d ON f.dim_key = d.dim_key;
-- In most modern databases, the optimizer will automatically choose a broadcast hash join
-- if one table is small enough, so explicit hints are often not needed unless
-- you want to override default behavior or for specific edge cases.
This section covers ETL optimization, data warehouse design, and CDC implementation strategies.
SCD Type 1: Overwrite old data.
Type 2: Add new row with versioning/timestamps.
Type 3: Add new column for previous value. Use ETL tools or SQL merge/upsert logic to manage SCDs.
Star schema: Central fact table linked to denormalized dimension tables; simple, fast queries.
Snowflake schema: Dimensions are normalized into multiple related tables; reduces redundancy but more complex joins.
Include transaction-level facts (e.g., sales amount, quantity), foreign keys to dimensions (date, product, customer), and measures (discount, tax). Ensure granularity matches business needs (e.g., order line item).
This section covers Apache Spark optimization, distributed computing, and big data processing concepts.
Repartition: Increases/decreases partitions, shuffles data.
Coalesce: Decreases partitions, minimizes shuffle.
Cache: Stores RDD/DataFrame in memory (default).
Persist: Stores in memory or disk, with configurable storage levels.
Table split across nodes for parallel processing.
Types: Sharded, Replicated, Partitioned.
Catalyst optimizer, Tungsten execution, predicate pushdown, broadcast joins, partition pruning, caching.
Narrow: Data from one partition needed (e.g., map, filter).
Wide: Data shuffled across partitions (e.g., groupBy, join).
Autoscaling: Dynamically adjusts cluster resources.
Auto-Termination: Shuts down cluster after inactivity.
RDD: Low-level, type-safe, no schema, less optimized.
DataFrame: High-level, schema, optimized via Catalyst.
Columnar storage, compression, schema evolution, efficient for analytics, supports predicate pushdown.
Delta Lake: ACID transactions on data lakes, supports time travel/versioning, schema enforcement, and rollback.
Use Delta Lake's mergeSchema option to allow new columns. E.g., df.write.option('mergeSchema', 'true').format('delta').mode('append').save(path). Delta automatically tracks schema changes.
Use a unified architecture: ingest batch data via scheduled jobs and streaming data via Spark Structured Streaming. Store both in a common data store (e.g., Delta Lake) and process with the same transformation logic.
This section covers practical Python and PySpark coding scenarios commonly asked in interviews.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df.head())
try:
# risky code
result = 10 / 0
except ZeroDivisionError as e:
print(f'Error: {e}')
finally:
print('Cleanup or final steps')
Use broadcast() for small DataFrames to avoid shuffles:
from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), 'key')
# For large-large joins, ensure both are partitioned on the join key.
from pyspark.sql.window import Window
import pyspark.sql.functions as F
window = Window.partitionBy('region').orderBy(F.desc('revenue'))
df.withColumn('rank', F.row_number().over(window)) \
.filter('rank <= 3') \
.select('customer_id', 'region', 'revenue')
Partitioning: Splits data into directories based on column values, improving query performance for partitioned columns.
Bucketing: Divides data into fixed buckets using a hash function, enabling efficient joins and sampling even if the bucket column isn't in the filter.
from pyspark.sql.types import StringType
event_hub_conf = { 'eventhubs.connectionString': '' }
df = (spark.readStream.format('eventhubs')
.options(**event_hub_conf)
.load())
df.writeStream.format('console').start().awaitTermination()
df = spark.read.format('csv').option('header', 'true').load('abfss://container@account.dfs.core.windows.net/path')
df.write.format('delta').mode('overwrite').save('/mnt/delta/table')
import pandas as pd
df = pd.read_csv('data.csv')
print(df.isnull().sum()) # missing values
print(df.describe()) # summary stats
anomalies = df[df['value'] > df['value'].mean() + 3*df['value'].std()]
print(anomalies)
from pyspark.sql.window import Window
import pyspark.sql.functions as F
window = Window.partitionBy('region').orderBy(F.desc('sales'))
df.withColumn('rank', F.rank().over(window)).show()
import pandas as pd
sales = pd.read_csv('sales.csv')
promos = pd.read_csv('promotions.csv')
merged = pd.merge(sales, promos, on='product_id')
valid = merged[merged['promotion_active'] == True]
total_sales = valid.groupby('product_id')['sales_amount'].sum()
print(total_sales)
Core Python concepts and interview questions covering data structures, memory management, and OOP principles.
Lists are mutable, meaning their elements can be changed, while tuples are immutable. Lists use square brackets [ ], tuples use parentheses ( ). Tuples are generally faster and can be used as dictionary keys if they contain only immutable elements.
Python uses automatic memory management with a built-in garbage collector that reclaims memory by reference counting and cyclic garbage collection. The gc module can be used to interact with the garbage collector.
List comprehensions provide a concise way to create lists. Example: [x*x for x in range(5)] creates [0, 1, 4, 9, 16].
A shallow copy creates a new object but does not create copies of nested objects; changes to nested objects affect both copies. A deep copy creates a new object and recursively copies all nested objects, so changes do not affect the original.
Use try, except blocks. Custom exceptions are created by subclassing Exception. Example: class MyError(Exception): pass and then try: ... except MyError: ...
*args allows a function to accept any number of positional arguments, while **kwargs allows for any number of keyword arguments. Useful when you don't know beforehand how many arguments will be passed.
Instance methods operate on the object instance and can access/modify object state. Class methods use @classmethod and take cls as the first argument; they can access/modify class state. Static methods use @staticmethod and don't access class or instance state.
Inheritance allows a class to inherit attributes and methods from another class. Example: class Animal: ... then class Dog(Animal): ...
MRO determines the order in which base classes are searched when executing a method. Python uses the C3 linearization algorithm, accessible via ClassName.__mro__.
Use the csv module or Pandas. Example: import csv; with open('file.csv') as f: reader = csv.reader(f) for reading. With Pandas: pd.read_csv('file.csv').
Common methods include head(), tail(), info(), describe(), groupby(), merge(), drop(), fillna(), and apply().
Use methods like dropna() to remove missing values or fillna() to replace them with a specific value or method (e.g., mean, median).
Use the merge() function for SQL-style joins or concat() for stacking DataFrames vertically or horizontally. Example: pd.merge(df1, df2, on='key').
On Linux, use cron jobs; on Windows, use Task Scheduler. Alternatively, use Python libraries like schedule or APScheduler for in-app scheduling.
__init__.py marks a directory as a Python package and can be used to execute package initialization code or set the __all__ variable.
Use the json module: import json; data = json.loads(json_string) to parse, and json.dumps(obj) to serialize.
Generators use yield to return values one at a time, maintaining state between calls, which makes them memory efficient for large datasets. Normal functions return all values at once with return.
Decorators are functions that modify the behavior of other functions. Example: @my_decorator above a function definition.
The GIL allows only one thread to execute Python bytecode at a time, which can limit CPU-bound multithreaded programs. For I/O-bound tasks, threading is still useful. For CPU-bound tasks, multiprocessing is recommended.
Use efficient data structures, vectorized operations with NumPy/Pandas, avoid loops where possible, use generators, and profile code to identify bottlenecks.
Use the append() method. Example: my_list.append(value)
Use the add() method. Example: my_set.add(value)
Lists are mutable (can be changed), tuples are immutable (cannot be changed). Lists use [ ], tuples use ( ).
Run python -m venv env and activate it with source env/bin/activate (Linux/Mac) or env\Scripts\activate (Windows).
REST is an architectural style using HTTP and is stateless, typically uses JSON, and is simpler. SOAP is a protocol, uses XML, is more rigid, and supports advanced features like security and transactions.
__init__ is the constructor method called when a new object is created from a class. It initializes the object's attributes.
Complex multi-table scenarios and advanced SQL techniques.
Uses LAG to find salary changes, calculates percentage increase, returns employee with highest increase.
WITH SalaryChanges AS (
SELECT s.EmployeeID, s.Salary, s.EffectiveDate,
LAG(s.Salary) OVER (PARTITION BY s.EmployeeID ORDER BY s.EffectiveDate) AS P
FROM Salaries s
WHERE s.EffectiveDate >= DATEADD(YEAR, -1, GETDATE())
)
, Increases AS (
SELECT EmployeeID, (Salary - PrevSalary) AS SalaryIncrease, PrevSalary, Salary
FROM SalaryChanges
WHERE PrevSalary IS NOT NULL AND PrevSalary > 0
)
SELECT TOP 1 e.EmployeeName, d.DepartmentName,
SalaryIncrease * 100.0 / PrevSalary AS PercentageIncrease
FROM Increases i
JOIN Employees e ON i.EmployeeID = e.EmployeeID
JOIN Departments d ON e.DepartmentID = d.DepartmentID
ORDER BY PercentageIncrease DESC;
Finds login streaks using date arithmetic, counts streaks, returns top 5 users.
WITH UserLogs AS (
SELECT UserID, CAST(LogTime AS DATE) AS LogDate
FROM Logs
GROUP BY UserID, CAST(LogTime AS DATE)
), Streaks AS (
SELECT UserID, LogDate,
ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY LogDate) AS rn,
DATEADD(DAY, -ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY LogDate), LogD
FROM UserLogs
), ConsecutiveCounts AS (
SELECT UserID, COUNT(*) AS ConsecutiveDays
FROM Streaks
GROUP BY UserID, grp
)
SELECT TOP 5 UserID, MAX(ConsecutiveDays) AS MaxConsecutiveDays
FROM ConsecutiveCounts
GROUP BY UserID
ORDER BY MaxConsecutiveDays DESC;
LEFT JOIN ensures all users are included; COALESCE returns zero for users with no transactions.
SELECT u.UserID, COALESCE(AVG(t.Amount), 0) AS AvgTransaction
FROM Users u
LEFT JOIN Transactions t ON u.UserID = t.UserID
GROUP BY u.UserID;
Compares count of products purchased by customer to total products; returns those who bought all.
SELECT c.CustomerID, c.CustomerName
FROM Customers c
JOIN Purchases p ON c.CustomerID = p.CustomerID
GROUP BY c.CustomerID, c.CustomerName
HAVING COUNT(DISTINCT p.ProductID) = (SELECT COUNT(DISTINCT ProductID) FROM Purchases);
Essential date and time manipulation queries for business logic.
Loops through each date between two dates and counts only weekdays. Replace @start_date and @end_date with your date columns or variables.
SELECT SUM(CASE WHEN DATENAME(WEEKDAY, d) NOT IN ('Saturday', 'Sunday') THEN 1 ELSE 0 E
FROM (
SELECT DATEADD(DAY, n, @start_date) AS d
FROM (SELECT TOP (DATEDIFF(DAY, @start_date, @end_date) + 1)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS n
FROM master..spt_values) t
) days;
Finds the first day of the previous month by truncating to month and subtracting one.
SELECT DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()) - 1, 0) AS first_day_prev_month;
Moves to the first day of the month after next, then subtracts one day.
SELECT DATEADD(DAY, -1, DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()) + 2, 0)) AS last_d
Adjusts timestamps to business hours if outside 9am-5pm. Assumes both timestamps are on the same day. For multi-day spans, logic must be extended.
SELECT
CASE
WHEN CAST(t1.timestamp AS TIME) < '09:00' THEN '09:00' ELSE CAST(t1.timestamp A
CASE
WHEN CAST(t2.timestamp AS TIME) > '17:00' THEN '17:00' ELSE CAST(t2.timestamp A
DATEDIFF(HOUR,
CASE WHEN CAST(t1.timestamp AS TIME) < '09:00' THEN DATEADD(HOUR, 9, CAST(CAST(
CASE WHEN CAST(t2.timestamp AS TIME) > '17:00' THEN DATEADD(HOUR, 17, CAST(CAST(
) AS business_hours_diff
FROM table1 t1
JOIN table2 t2 ON (join_condition);
Calculates days to add to reach the next desired weekday. Adjust @target_weekday as needed (1=Sunday, 2=Monday, ..., 7=Saturday).
DECLARE @target_weekday INT = 4; -- 1=Sunday, 4=Wednesday
SELECT DATEADD(DAY, ((@target_weekday - DATEPART(WEEKDAY, GETDATE()) + 7) % 7) + 1, GET
Advanced topics covering frameworks, APIs, data processing, database optimization, and DevOps practices.
Django is a full-stack web framework offering a lot of built-in features, suitable for larger applications requiring an ORM and admin interface. Flask is a micro-framework, ideal for smaller applications or services where flexibility is needed. FastAPI is designed for building APIs quickly with automatic generation of OpenAPI documentation, and is particularly useful for asynchronous applications.
The process involves defining the API endpoints and their methods (GET, POST, PUT, DELETE), ensuring data validation and error handling, and implementing authentication and authorization. Best practices include using consistent naming conventions, versioning APIs, and providing comprehensive documentation.
Pandas is used for data manipulation and analysis, providing data structures like DataFrames for handling complex datasets. NumPy is used for numerical computations, offering efficient array operations. An example might be cleaning and transforming data from CSV files using Pandas, and performing mathematical operations using NumPy.
Strategies include indexing columns that are frequently queried, optimizing query structure, using joins efficiently, and caching results where possible. Regularly analyzing query performance and adjusting based on database usage patterns is also crucial.
Deployment involves containerizing them using Docker to ensure consistency across environments, and using Jenkins for continuous integration and deployment. This process typically includes setting up automated build and test pipelines, and managing deployments on Linux servers.
Tesseract OCR is used for text extraction from images and PDFs, while OpenCV is used for pre-processing images to improve OCR accuracy. Challenges may include handling varied document formats and ensuring high accuracy in text extraction, which can be addressed through image enhancement techniques and fine-tuning OCR settings.
Comprehensive questions covering Azure Data Factory, Databricks, Synapse Analytics, and Azure ecosystem.
Define linked services (data sources), create datasets, build pipelines with activities (Copy, Data Flow, etc.), use triggers for scheduling, and monitor pipeline runs. Parameterize for reusability and handle errors with activities like Web or Stored Procedure.
Schedule Trigger: Runs at specific times.
Tumbling Window: Fixed-size, non-overlapping intervals for batch processing.
Event-based: Responds to events like file arrival. Use based on data arrival pattern and business requirements.
Azure Databricks is a managed Spark platform with collaborative notebooks. It integrates with Delta Lake for ACID transactions, schema enforcement, and time travel on data lakes. Data is stored in Azure Data Lake Storage and processed via Databricks clusters.
Use result set caching, materialized views, partitioned tables, proper distribution (hash/round robin/replicate), minimize data movement, and leverage indexes. Monitor with Query Performance Insight.
Use star/snowflake schema for sales, inventory, customer, and product data. Store raw data in Data Lake, use Synapse pipelines for ETL, and create dedicated SQL pools for analytics. Partition and distribute tables for performance.
Use RBAC and ACLs for granular access, enable encryption at rest and in transit, use private endpoints, monitor with Azure Monitor, and enable firewall rules.
Store secrets, keys, and certificates in Key Vault. Grant access via Azure AD, use managed identities for services, and reference secrets directly in ADF, Databricks, or Synapse configurations.
Register data sources, scan assets, and use Purview to track lineage from ingestion to consumption. Define business glossary, classify sensitive data, and monitor data access and movement.
Ingest streaming data with Event Hub, process/aggregate in Azure Stream Analytics, and output to Synapse Analytics for real-time dashboards and reporting.
Use watermarking and windowing to process late data, design ETL to allow reprocessing, and maintain audit logs for late arrivals. In ADF, use tumbling window triggers with late arrival tolerance.
Use watermark columns (e.g., last modified date), store high-watermark values, and filter source data for new/changed records. Use ADF's built-in incremental copy or custom logic with parameters.
Choose appropriate distribution (hash/round robin/replicate), partition large tables, use materialized views, minimize data movement, compress data, and regularly update statistics.
Ingest data from sources (ADLS, SQL, APIs) using ADF, process and transform in Databricks (cleaning, enrichment, aggregation), write results to Synapse dedicated pool, and build Power BI dashboards for reporting. Use orchestration and monitoring for reliability.
Key Power BI concepts covering data modes, security, and visualization techniques.
Import: Loads data into Power BI for fast, in-memory analysis but is static and needs refreshing.
Direct Query: Leaves data in the source, enabling real-time queries but may be slower and limited by source performance.
For very large datasets, use Direct Query to avoid memory constraints, but optimize source performance.
Slicers: Are visual controls that let users filter data interactively across multiple visuals.
Visual-level filters: Apply only to a single visual.
Slicers provide a dashboard-wide filtering experience, improving interactivity and user-driven analysis.
Define roles and DAX filters in Power BI Desktop (e.g., [Country] = USERPRINCIPALNAME()), Publish to Power BI Service, assign users/groups to roles. RLS ensures users see only data relevant to them, enforcing data privacy.
Paginated reports are pixel-perfect, printable reports ideal for multi-page outputs like invoices or billing statements. They allow precise control over layout and are used when detailed, printable documents are required.
Analytical and estimation questions to test problem-solving and structured thinking.
Break down by regions, estimate active users, average bookings per user, and seasonality. Use public data (market share, travel trends), adjust for weekends/holidays, and validate with industry reports. Show step-by-step logic and state assumptions clearly.
Estimate total site traffic, % interested in flights, average searches per user, and repeat users. Use funnel analysis: total visitors → % flight searchers → unique users. State assumptions and adjust for mobile/web split.
Real-world scenario-based questions to assess analytical and problem-solving skills.
Analyze funnel data (searches, clicks, bookings), segment by device, channel, and user type. Check for UI changes, pricing issues, inventory problems, or external events. Propose A/B tests, user surveys, and cross-team collaboration to address findings.
Track adoption rate (feature usage), conversion rate, average booking value, customer retention, and feedback. Compare with single-city bookings, monitor technical performance, and analyze user cohorts for repeat usage and upsell/cross-sell impact.
Questions about your personal project experience and contributions.
Briefly describe the project domain, tech stack, business goal, and your roles (e.g., ETL design, data modeling, performance tuning, automation, etc.).
Advanced SQL queries covering window functions, aggregations, and complex scenarios.
SELECT department_id, salary
FROM (
SELECT department_id, salary,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rn
FROM employees
) ranked
WHERE rn = 2;
-- DENSE_RANK() can be used if you want to handle ties.
SELECT user_id, transaction_date, COUNT(*) AS total_transactions
FROM transactions
GROUP BY user_id, transaction_date;
SELECT p.project_id, p.project_name, (p.budget / COUNT(e.employee_id)) AS budget_per_em
FROM projects p
JOIN employees e ON p.project_id = e.project_id
GROUP BY p.project_id, p.project_name, p.budget
ORDER BY budget_per_employee DESC
LIMIT 1;
Finds the maximum salary less than the overall max, giving the second-highest.
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
Self-join employees to their managers and compare salaries.
SELECT e1.*
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.id
WHERE e1.salary > e2.salary;
Uses EXISTS to find duplicates based on columns, avoiding GROUP BY.
SELECT t1.*
FROM table_name t1
WHERE EXISTS (
SELECT 1 FROM table_name t2
WHERE t1.column1 = t2.column1 AND t1.column2 = t2.column2 AND t1.rowid <> t2.rowid
);
NTILE(10) splits data into 10 buckets; decile 1 is top 10%.
SELECT *
FROM (
SELECT *, NTILE(10) OVER (ORDER BY salary DESC) AS decile
FROM employees
) ranked
WHERE decile = 1;
SUM() OVER (ORDER BY ...) computes running totals.
SELECT *, column - LEAD(column) OVER (ORDER BY id) AS diff_with_next
FROM table_name;
Finds employees whose IDs are not in the leave records.
SELECT *
FROM employees
WHERE id NOT IN (SELECT employee_id FROM leaves);
LEAD() gets the next row's value for difference calculation.
SELECT *, column - LEAD(column) OVER (ORDER BY id) AS diff_with_next
FROM table_name;
GROUP BY and HAVING find departments with multiple employees.
SELECT department
FROM employees
GROUP BY department
HAVING COUNT(*) > 1;
Window function MAX() OVER (PARTITION BY ...) finds group max per row.
SELECT *
FROM (
SELECT *, MAX(column) OVER (PARTITION BY group_column) AS max_in_group
FROM table_name
) t
WHERE column = max_in_group;
Group by employee and month, then filter for those with >3 leaves.
SELECT employee_id, EXTRACT(YEAR FROM leave_date) AS year, EXTRACT(MONTH FROM leave_dat
FROM leaves
GROUP BY employee_id, year, month
HAVING COUNT(*) > 3;
Real-world SQL scenarios testing analytical and problem-solving abilities.
Use ORDER BY with LIMIT for global top N; use ROW_NUMBER() OVER (PARTITION BY ...) for top N within groups.
-- Top N products overall
SELECT product_id, SUM(sales) AS total_sales
FROM sales
GROUP BY product_id
ORDER BY total_sales DESC
LIMIT N;
-- Top N products within each category
SELECT *
FROM (
SELECT product_id, category_id, SUM(sales) AS total_sales,
ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY SUM(sales) DESC) AS rn
FROM sales
GROUP BY product_id, category_id
) ranked
WHERE rn <= N;
Use LAG() for YOY or month-over-month comparisons. Aggregate sales by time period and group.
-- YOY growth for each category
SELECT category_id, year, sales,
sales - LAG(sales) OVER (PARTITION BY category_id ORDER BY year) AS yoy_growth
FROM (
SELECT category_id, EXTRACT(YEAR FROM sale_date) AS year, SUM(sales) AS sales
FROM sales
GROUP BY category_id, year
) yearly;
-- Products with higher sales than previous month
SELECT product_id, month, sales
FROM (
SELECT product_id, EXTRACT(MONTH FROM sale_date) AS month, SUM(sales) AS sales,
LAG(SUM(sales)) OVER (PARTITION BY product_id ORDER BY month) AS prev_month_
FROM sales
GROUP BY product_id, month
) t
WHERE sales > prev_month_sales;
Use SUM() OVER with appropriate window frames for running or rolling totals.
-- Running total sales per product
SELECT product_id, month, SUM(sales) OVER (PARTITION BY product_id ORDER BY month) AS r
FROM sales;
-- Rolling 3-month sales per category
SELECT category_id, month, SUM(sales) OVER (
PARTITION BY category_id ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS rolling_3_month_sales
FROM sales;
Use CASE WHEN inside aggregate functions to pivot data from rows to columns.
SELECT category_id,
SUM(CASE WHEN year = 2021 THEN sales ELSE 0 END) AS sales_2021,
SUM(CASE WHEN year = 2022 THEN sales ELSE 0 END) AS sales_2022
FROM (
SELECT category_id, EXTRACT(YEAR FROM sale_date) AS year, sales
FROM sales
) t
GROUP BY category_id;
Different joins affect row counts based on matching and non-matching records. Useful for data quality checks and understanding data relationships.
-- INNER JOIN
SELECT COUNT(*) FROM a INNER JOIN b ON a.id = b.id;
-- LEFT JOIN
SELECT COUNT(*) FROM a LEFT JOIN b ON a.id = b.id;
-- RIGHT JOIN
SELECT COUNT(*) FROM a RIGHT JOIN b ON a.id = b.id;
-- FULL OUTER JOIN
SELECT COUNT(*) FROM a FULL OUTER JOIN b ON a.id = b.id;
Joins Orders table to itself to find customers with orders on consecutive days.
SELECT DISTINCT o1.CustomerID
FROM Orders o1
JOIN Orders o2
ON o1.CustomerID = o2.CustomerID
AND DATEDIFF(DAY, o1.OrderDate, o2.OrderDate) = 1;
Essential SQL optimization techniques and coding standards for efficient query writing.
EXISTS is often more efficient, especially with correlated subqueries, as it stops searching once a match is found.
Improves readability and avoids ambiguity, especially in complex queries.
GROUP BY is more flexible and can be optimized better by query optimizers.
Comments help others (and your future self) understand non-trivial logic, but excessive comments can clutter code.
Joins are typically more efficient and easier for queryoptimizers to handle than nested subqueries.
WHERE filters rows before aggregation, improving performance. HAVING should be used only for aggregate conditions.
Leading wildcards prevent index usage, causing full table scans.
Grouping by columns with higher cardinality first can improve query performance by reducing intermediate result size.
Capitalizing SQL keywords (SELECT, FROM, WHERE) enhances SQL statement readability and makes SQL statements easier to scan visually.
Improves performance, reduces unnecessary data transfer, and makes breaking changes to schemas easier to detect.
CTEs (WITH clauses) make queries easier to read, maintain, and debug.
Increases readability and reduces risk of accidental cross joins or logic errors.
Increases runtime without benefit.
UNION ALL is faster because it skips duplicate elimination. Use UNION only when necessary.
Allows easy commenting/uncommenting of conditions during query development.
Prevents unexpected results or errors, as NULL comparisons behave differently.
Improves performance and reducing resource usage.
Indexed joins are faster and more efficient, reducing query execution time.