Data Engineering Interview Questions & Solutions

SQL Interview Questions

Comprehensive SQL questions covering window functions, joins, aggregations, and query optimization.

1. Find the top 3 cities with the highest sales per month

Sample Data/Table:

sale_id city sale_date amount
1 Mumbai 2024-01-10 5000
2 Delhi 2024-01-15 7000
3 Bangalore 2024-01-20 10000
4 Chennai 2024-02-05 3000
5 Mumbai 2024-02-08 9000

SQL Solution / Explanation:

SELECT sale_month, city, total_sales
FROM (
    SELECT
        DATE_FORMAT(sale_date, '%Y-%m') AS sale_month,
        city,
        SUM(amount) AS total_sales,
        ROW_NUMBER() OVER (PARTITION BY DATE_FORMAT(sale_date, '%Y-%m') ORDER BY SUM(amount) DESC) AS rn
    FROM sales_table
    GROUP BY sale_month, city
) ranked
WHERE rn <= 3;

Explanation: This query uses window functions to rank cities by total sales within each month. The ROW_NUMBER() function assigns ranks, and we filter for the top 3 cities per month using WHERE rn <= 3.

2. Write an SQL query to calculate the running total of sales for each city

Sample Data/Table:

sale_id city sale_date amount
1 Mumbai 2024-01-10 5000
2 Delhi 2024-01-15 7000
3 Mumbai 2024-01-20 3000
4 Delhi 2024-02-05 6000
5 Mumbai 2024-02-08 8000

SQL Solution / Explanation:

SELECT
    sale_id,
    city,
    sale_date,
    amount,
    SUM(amount) OVER (PARTITION BY city ORDER BY sale_date) AS running_total
FROM sales_data
ORDER BY city, sale_date;

Explanation: The SUM(amount) OVER (PARTITION BY city ORDER BY sale_date) creates a running total for each city. The window function accumulates the sum ordered by sale date within each city partition.

3. Find the second highest salary of employees

Sample Data/Table:

emp_id emp_name salary department
1 Ravi 70000 HR
2 Priya 90000 IT
3 Kunal 85000 Finance
4 Aisha 60000 IT
5 Rahul 95000 HR

SQL Solution / Explanation:

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Explanation: This query finds the maximum salary that is less than the overall maximum salary. The subquery finds the highest salary, and the outer query finds the maximum of all remaining salaries.

4. Find employees who have the same salary as someone in the same department

Sample Data/Table:

emp_id emp_name salary department
1 Neha 50000 HR
2 Ravi 70000 IT
3 Aman 50000 HR
4 Pooja 90000 IT
5 Karan 70000 IT

SQL Solution / Explanation:

SELECT *
FROM employee_salary e1
WHERE EXISTS (
    SELECT 1
    FROM employee_salary e2
    WHERE e1.department = e2.department
      AND e1.salary = e2.salary
      AND e1.emp_id <> e2.emp_id
);

Explanation: This query uses a correlated subquery with EXISTS to find employees who have at least one other employee in the same department with the same salary. The condition e1.emp_id <> e2.emp_id ensures we don't match an employee with themselves.

5. Write an SQL query to find duplicate records in a table

Sample Data/Table:

user_id user_name email
1 Sameer sameer@gmail.com
2 Anjali anjali@gmail.com
3 Sameer sameer@gmail.com
4 Rohan rohan@gmail.com
5 Rohan rohan@gmail.com

SQL Solution / Explanation:

SELECT user_name, email, COUNT(*)
FROM users
GROUP BY user_name, email
HAVING COUNT(*) > 1;

Explanation: This query groups records by user_name and email, then uses HAVING COUNT(*) > 1 to filter only the groups that have more than one occurrence, effectively finding duplicates.

6. Write an SQL query to delete duplicate rows while keeping only one unique record

Sample Data/Table:

user_id user_name email
1 Sameer sameer@gmail.com
2 Anjali anjali@gmail.com
3 Sameer sameer@gmail.com
4 Rohan rohan@gmail.com
5 Rohan rohan@gmail.com

SQL Solution / Explanation:

DELETE FROM users
WHERE user_id NOT IN (
    SELECT MIN(user_id)
    FROM users
    GROUP BY user_name, email
);

Explanation: This query keeps the record with the minimum user_id for each unique combination of user_name and email, and deletes all other duplicate records. The subquery identifies which IDs to keep, and the outer DELETE removes everything else.

7. Write an SQL query to pivot a table by months

Sample Data/Table:

sale_id city sale_date amount
1 Mumbai 2024-01-10 5000
2 Delhi 2024-02-15 7000
3 Mumbai 2024-01-20 3000
4 Delhi 2024-03-05 6000
5 Mumbai 2024-02-08 8000

SQL Solution / Explanation:

SELECT
    city,
    SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-01' THEN amount ELSE 0 END) AS Jan_2024,
    SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-02' THEN amount ELSE 0 END) AS Feb_2024,
    SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-03' THEN amount ELSE 0 END) AS Mar_2024
FROM sales_data
GROUP BY city;

Explanation: This query uses conditional aggregation with CASE statements to pivot the data. Each month becomes a separate column, and the sum of amounts for that month is calculated for each city.

8. Find customers who placed at least 3 orders in the last 6 months

Sample Data/Table:

order_id customer_id order_date amount
1 101 2024-01-10 1000
2 102 2024-02-15 2000
3 101 2024-03-20 1500
4 103 2024-04-05 2500
5 101 2024-05-08 3000

SQL Solution / Explanation:

SELECT customer_id
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
GROUP BY customer_id
HAVING COUNT(*) >= 3;

Explanation: This query filters orders from the last 6 months using DATE_SUB(CURDATE(), INTERVAL 6 MONTH), groups by customer, and uses HAVING COUNT(*) >= 3 to find customers with at least 3 orders in that period.

9. Normalization vs. Denormalization – What are they, and when should each be used in a data pipeline?

Explanation:

📊

Normalization

Breaking tables into smaller, related tables to reduce redundancy and improve data integrity. Use for OLTP, frequent updates, data consistency.

OLTP Data Integrity Reduced Redundancy
📈

Denormalization

Combining tables to reduce joins and improve read performance. Use for OLAP, reporting, analytics.

OLAP Reporting Fast Reads

10. Indexing in SQL – Explain clustered vs. non-clustered indexes. How do they impact query performance?

Explanation:

🗂️

Clustered Index

Determines physical order of data, one per table, fast for range queries.

Physical Order One Per Table Range Queries
📑

Non-Clustered Index

Separate structure, can have many per table, fast for lookups. Indexes speed up SELECTs but slow down writes.

Separate Structure Multiple Allowed Fast Lookups

11. Write an SQL query to find the second highest salary from an employee table

SQL Solution / Explanation:

SELECT MAX(salary) AS second_highest_salary
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
-- Alternatively, use ROW_NUMBER() or DENSE_RANK() for ties.

Explanation: Finds the maximum salary that is less than the overall maximum. Alternative methods include using ROW_NUMBER() or DENSE_RANK() window functions for handling ties.

12. How do you handle NULL values in SQL joins?

Explanation:

Use IS NULL or COALESCE() to manage NULLs. In joins, NULLs can lead to unmatched rows. For example, LEFT JOIN returns NULLs for missing matches. Use COALESCE(column, 'default') to replace NULLs.

13. Extract pipeline name, current month name, number of failures, and identify the maximum failures in the current month

SQL Solution / Explanation:

SELECT pipeline_name, MONTHNAME(run_date) AS month, COUNT(*) AS failures,
       MAX(COUNT(*)) OVER () AS max_failures
FROM pipeline_log
WHERE status='FAILED' AND MONTH(run_date)=MONTH(CURDATE())
GROUP BY pipeline_name, MONTHNAME(run_date);

Explanation: Uses MONTHNAME() to get the month name, counts failures per pipeline, and uses a window function to find the maximum failures across all pipelines.

14. Develop a code to generate the following output based on the provided input

SQL Solution / Explanation:

SELECT id,
       AVG(CASE WHEN item_name='Apple' THEN value END) AS Apple,
       AVG(CASE WHEN item_name='Orange' THEN value END) AS Orange,
       AVG(CASE WHEN item_name='Banana' THEN value END) AS Banana
FROM input_table
GROUP BY id;

Explanation: Uses conditional aggregation with CASE statements to pivot data, creating separate columns for each item type.

15. Identify the 3rd highest sales amount in each region based on the saleid, product, region, and salesamount data

SQL Solution / Explanation:

SELECT region, salesamount
FROM (
    SELECT region, salesamount,
           ROW_NUMBER() OVER (PARTITION BY region ORDER BY salesamount DESC) AS rn
    FROM sales
) t
WHERE rn = 3;

Explanation: Uses ROW_NUMBER() window function partitioned by region and ordered by salesamount descending, then filters for the 3rd row in each partition.

16. Determine the left, right, and inner outputs from the given dataset

Explanation:

⬅️

Left Join

All rows from A, matched/unmatched from B.

➡️

Right Join

All rows from B, matched/unmatched from A.

🔗

Inner Join

Only rows where both A and B are not null and match.

17. Write an SQL query to calculate the customer churn rate over the last 6 months

SQL Solution / Explanation:

SELECT COUNT(DISTINCT customer_id) AS churned_customers,
       (SELECT COUNT(DISTINCT customer_id) FROM customers
        WHERE signup_date BETWEEN DATEADD(month, -6, GETDATE()) AND GETDATE()) AS total_customers,
       COUNT(DISTINCT customer_id)*1.0 /
       (SELECT COUNT(DISTINCT customer_id) FROM customers
        WHERE signup_date BETWEEN DATEADD(month, -6, GETDATE()) AND GETDATE()) AS churn_rate
FROM customers
WHERE last_active_date < DATEADD(month, -6, GETDATE());

Explanation: Calculates churn rate by dividing churned customers (inactive for 6+ months) by total customers who signed up in the last 6 months.

18. Calculate the cancellation rate for each room type over the last 6 months, considering only bookings of minimum stay of 2 nights

Detailed Interview Response:

SELECT room_type,
       COUNT(CASE WHEN status='cancelled' THEN 1 END)*1.0/COUNT(*) AS cancellation_rate
FROM bookings
WHERE stay_nights >= 2
  AND booking_date >= DATEADD(month, -6, CURRENT_DATE)
GROUP BY room_type;

Explanation: Filters bookings with 2+ night stays from the last 6 months, then calculates the proportion of cancelled bookings per room type.

19. Determine the average conversion rate (confirmed bookings vs. search events) for users grouped by their country and device type

Detailed Interview Response:

SELECT country, device_type,
       AVG(CASE WHEN booking_id IS NOT NULL THEN 1 ELSE 0 END) AS avg_conversion_rate
FROM searches
LEFT JOIN bookings ON searches.user_id = bookings.user_id AND searches.session_id = bookings.session_id
GROUP BY country, device_type;

Explanation: Joins search events with confirmed bookings, then calculates the conversion rate as the proportion of searches that resulted in bookings, grouped by country and device.

20. Identify properties that have consistently underperformed compared to the average booking rate of their region over the last 12 months

Detailed Interview Response:

WITH region_avg AS (
    SELECT region, AVG(booking_rate) AS avg_rate
    FROM properties
    WHERE booking_date >= DATEADD(month, -12, CURRENT_DATE)
    GROUP BY region
)
SELECT p.property_id
FROM properties p
JOIN region_avg r ON p.region = r.region
WHERE p.booking_rate < r.avg_rate
  AND p.booking_date >= DATEADD(month, -12, CURRENT_DATE);

Explanation: Uses a CTE to calculate regional average booking rates, then identifies properties performing below their region's average over the past year.

21. Detect instances of demand surge where the number of bookings in an hour exceeds the hourly average by more than 50%

Detailed Interview Response:

WITH hourly AS (
    SELECT DATE_TRUNC('hour', booking_time) AS hour, COUNT(*) AS bookings
    FROM bookings
    GROUP BY hour
),
avg_hourly AS (
    SELECT AVG(bookings) AS avg_bookings FROM hourly
)
SELECT h.hour, h.bookings
FROM hourly h, avg_hourly a
WHERE h.bookings > 1.5 * a.avg_bookings;

Explanation: Calculates hourly booking counts, compares each hour to the overall hourly average, and identifies hours exceeding 150% of the average.

22. What challenges might arise when querying sharded databases, especially for calculating global metrics like average booking rates?

Detailed Interview Response:

Data distribution across shards makes cross-shard aggregation complex. Challenges include data consistency, network latency, partial failures, and the need for distributed queries or data pipelines to aggregate metrics globally.

23. Explain how you would handle booking timestamps originating from different time zones when querying for global daily booking patterns

Detailed Interview Response:

Normalize all timestamps to a standard (e.g., UTC) during ingestion or in queries. Store original and converted timestamps if local analysis is needed. Use timezone-aware functions to aggregate by global day.

24. How would you balance normalization for data integrity and denormalization for query performance?

Detailed Interview Response:

Normalize for OLTP systems to reduce redundancy and ensure integrity. Denormalize in OLAP/data warehouses for faster analytics. Use views or materialized views to combine normalized data for reporting.

25. If two systems simultaneously update the same booking record, what mechanisms would you use in SQL to prevent data conflicts and ensure consistency?

Detailed Interview Response:

Use transactions with isolation levels (e.g., SERIALIZABLE), optimistic concurrency control (row versioning/timestamps), or database locking mechanisms to prevent lost updates and ensure consistency.

26. Explain the scenarios where window functions outperform traditional group-by clauses in SQL

Detailed Interview Response:

Window functions allow calculations (e.g., running totals, ranks) across rows without collapsing them, enabling complex analytics (like moving averages or percentiles) that GROUP BY cannot provide without subqueries.

27. Find all employees who earn more than the average salary

Explanation:

Compares each employee's salary to the overall average.

Sample SQL Query:

SELECT * FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);

28. Retrieve names of employees who work in the same department as 'John'

Explanation:

Finds employees in John's department, excluding John.

Sample SQL Query:

SELECT name FROM employees WHERE department_id = (SELECT department_id FROM employees WHERE name = 'John');

29. Display the second highest salary from the Employee table

Explanation:

Gets the highest salary less than the maximum.

Sample SQL Query:

SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);

30. Find all customers who have made more than five orders

Explanation:

Groups orders by customer and filters for those with >5.

Sample SQL Query:

SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(*) > 5;

31. Count the number of orders placed by each customer

Explanation:

Simple aggregation by customer.

Sample SQL Query:

SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id;

32. Retrieve employees who joined in the last 6 months

Explanation:

Filters by join date within last 6 months.

Sample SQL Query:

SELECT * FROM employees WHERE join_date >= DATEADD(MONTH, -6, GETDATE());

33. Find the total sales amount per product

Explanation:

Aggregates sales by product.

Sample SQL Query:

SELECT product_id, SUM(sales_amount) AS total_sales FROM sales GROUP BY product_id;

34. List all products that have never been sold

Explanation:

Finds products with no matching sales.

Sample SQL Query:

SELECT p.product_id FROM products p LEFT JOIN sales s ON p.product_id = s.product_id WHERE s.product_id IS NULL;

35. Update salary of employees based on performance rating

Explanation:

Example: 10% raise for 'Excellent' performers. Adjust logic as needed.

Sample SQL Query:

UPDATE employees SET salary = salary * 1.10 WHERE performance_rating = 'Excellent';

36. Delete duplicate rows from a table

Explanation:

Keeps the first occurrence, deletes others. Replace col1, col2, ... with relevant columns.

Sample SQL Query:

WITH cte AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY col1, col2, ... ORDER BY (SELECT NULL)) AS rn
    FROM table_name
)
DELETE FROM cte WHERE rn > 1;

37. Find the second highest salary without using MAX twice

Explanation:

Uses window function for ranking.

Sample SQL Query:

SELECT salary
FROM (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk FROM employees
) t
WHERE rnk = 2;

38. Top 10 customers who have not placed an order in the last year

Explanation:

Finds customers with no orders in the last year.

Sample SQL Query:

SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id AND o.order_date >= DATEADD(YEAR, -1, GETDATE())
WHERE o.order_id IS NULL
ORDER BY c.customer_id
LIMIT 10;

39. Compute year-over-year growth rate of revenue for each product category

Explanation:

Uses LAG() for previous year's revenue and calculates growth.

Sample SQL Query:

SELECT category_id, year, revenue,
       (revenue - LAG(revenue) OVER (PARTITION BY category_id ORDER BY year)) / NULLIF(LAG(revenue) OVER (PARTITION BY category_id ORDER BY year), 0) AS growth_rate
FROM (
    SELECT category_id, EXTRACT(YEAR FROM sale_date) AS year, SUM(revenue) AS revenue
    FROM sales
    GROUP BY category_id, year
) t;

40. Join three tables and show records that exist in exactly two of the tables

Explanation:

Counts presence in each table, filters for exactly two.

Sample SQL Query:

SELECT id
FROM (
    SELECT id,
           (CASE WHEN a.id IS NOT NULL THEN 1 ELSE 0 END +
            CASE WHEN b.id IS NOT NULL THEN 1 ELSE 0 END +
            CASE WHEN c.id IS NOT NULL THEN 1 ELSE 0 END) AS cnt
    FROM a FULL OUTER JOIN b ON a.id = b.id
    FULL OUTER JOIN c ON a.id = c.id OR b.id = c.id
) t
WHERE cnt = 2;

41. Find the median sales amount for each region

Explanation:

Uses window function for median (SQL Server 2012+, PostgreSQL).

Sample SQL Query:

SELECT region,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sales_amount) AS median_sales
FROM sales
GROUP BY region;

42. Calculate the retention rate of customers over a given time period

Explanation:

Calculates the proportion of customers who made repeat purchases after their first order.

Sample SQL Query:

WITH cohort AS (
    SELECT customer_id, MIN(order_date) AS first_order
    FROM orders
    GROUP BY customer_id
), retained AS (
    SELECT c.customer_id
    FROM cohort c
    JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_date >= DATEADD(MONTH, 1, c.first_order)
)
SELECT COUNT(*) * 1.0 / (SELECT COUNT(*) FROM cohort) AS retention_rate FROM retained;

43. Find duplicate records and count number of duplicates for each unique record

Explanation:

Groups by all columns that define a duplicate.

Sample SQL Query:

SELECT col1, col2, COUNT(*) AS duplicate_count FROM table_name GROUP BY col1, col2 HAVING COUNT(*) > 1;

44. Employees with >5 years tenure but never promoted

Explanation:

Finds employees with long tenure and no promotion record.

Sample SQL Query:

SELECT e.*
FROM employees e
LEFT JOIN promotions p ON e.employee_id = p.employee_id
WHERE DATEDIFF(YEAR, e.join_date, GETDATE()) > 5 AND p.employee_id IS NULL;

45. Write a SQL query to get the daily count of active users (logged in at least once)

Explanation:

Groups login events by date and counts distinct users.

Sample SQL Query:

SELECT CAST(login_timestamp AS DATE) AS login_date, COUNT(DISTINCT user_id) AS active_users
FROM user_logins
GROUP BY CAST(login_timestamp AS DATE)
ORDER BY login_date;

46. Find the 2nd highest transaction per user without using LIMIT or TOP

Explanation:

Uses DENSE_RANK() to rank transactions within each user and filters for the 2nd rank.

Sample SQL Query:

SELECT user_id, transaction_amount
FROM (
    SELECT user_id, transaction_amount,
           DENSE_RANK() OVER (PARTITION BY user_id ORDER BY transaction_amount DESC) AS rnk
    FROM transactions
) ranked_transactions
WHERE rnk = 2;

47. Identify data gaps in time-series event logs (e.g., missing hourly records)

Explanation:

Generates a series of expected hours and LEFT JOINs with actual data to find missing hours. This requires a way to generate a series of dates/times.

Sample SQL Query (conceptual, depends on DB's date generation):

WITH all_hours AS (
    SELECT GENERATE_SERIES('2024-01-01 00:00:00'::timestamp, '2024-01-01 23:00:00'::timestamp, '1 hour') AS hour_start
), actual_hours AS (
    SELECT DATE_TRUNC('hour', event_timestamp) AS hour_start, COUNT(*) AS event_count
    FROM event_logs
    WHERE event_timestamp >= '2024-01-01 00:00:00' AND event_timestamp < '2024-01-02 00:00:00'
    GROUP BY 1
)
SELECT ah.hour_start AS missing_hour
FROM all_hours ah
LEFT JOIN actual_hours act ON ah.hour_start = act.hour_start
WHERE act.event_count IS NULL
ORDER BY missing_hour;

48. Fetch the first purchase date per user and calculate days since then

Explanation:

Uses a subquery to find the minimum purchase date for each user, then calculates the difference from the current date.

Sample SQL Query:

SELECT user_id, MIN(purchase_date) AS first_purchase_date,
       DATEDIFF(CURRENT_DATE, MIN(purchase_date)) AS days_since_first_purchase
FROM purchases
GROUP BY user_id;

49. Detect schema changes in SCD Type 2 tables using Delta Lake

Explanation:

Delta Lake automatically tracks schema changes in its transaction log. You can query the history to see schema evolution or use `mergeSchema` option during writes. Detecting *unintended* changes would involve comparing the current schema to a predefined expected schema.

Conceptual PySpark/Delta Lake approach:

# To see schema history
spark.sql("DESCRIBE HISTORY delta.`/path/to/delta/table`").show()

# To compare current schema with expected (conceptual)
# current_schema = spark.read.format("delta").load("/path/to/delta/table").schema
# expected_schema = StructType([...]) # Define your expected schema
# if current_schema != expected_schema:
#     print("Schema mismatch detected!")

50. Join product and transaction tables and filter out null foreign keys safely

Explanation:

Using an `INNER JOIN` automatically filters out rows where the join key is NULL in either table, ensuring only matched records are returned.

Sample SQL Query:

SELECT t.*, p.product_name
FROM transactions t
INNER JOIN products p ON t.product_id = p.product_id;
-- This implicitly handles null foreign keys by only returning matched rows.
-- If you need non-matching transactions, use LEFT JOIN and then filter out NULLs:
-- SELECT t.*, p.product_name FROM transactions t LEFT JOIN products p ON t.product_id = p.product_id WHERE p.product_id IS NOT NULL;

51. Get users who upgraded to premium within 7 days of signup

Explanation:

Joins users and premium_subscriptions, then filters based on the date difference.

Sample SQL Query:

SELECT u.user_id, u.signup_date, ps.upgrade_date
FROM users u
JOIN premium_subscriptions ps ON u.user_id = ps.user_id
WHERE ps.upgrade_date <= DATEADD(DAY, 7, u.signup_date);

52. Calculate cumulative distinct product purchases per customer

Explanation:

Uses a window function with `COUNT(DISTINCT ...)` to get a running count of unique products bought by each customer over time.

Sample SQL Query:

SELECT
    customer_id,
    purchase_date,
    product_id,
    COUNT(DISTINCT product_id) OVER (PARTITION BY customer_id ORDER BY purchase_date ROWS UNBOUNDED PRECEDING) AS cumulative_distinct_products
FROM purchases
ORDER BY customer_id, purchase_date;

53. Retrieve customers who spent above average in their region

Explanation:

Uses a CTE to calculate average regional spending, then joins back to customers to filter those above average.

Sample SQL Query:

WITH RegionalAvgSpend AS (
    SELECT c.region, AVG(t.amount) AS avg_region_spend
    FROM customers c
    JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.region
)
SELECT c.customer_id, c.customer_name, SUM(t.amount) AS total_spend, ras.avg_region_spend
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
JOIN RegionalAvgSpend ras ON c.region = ras.region
GROUP BY c.customer_id, c.customer_name, c.region, ras.avg_region_spend
HAVING SUM(t.amount) > ras.avg_region_spend;

54. Find duplicate rows in an ingestion table (based on all columns)

Explanation:

Groups by all columns and counts occurrences. Filters for counts greater than 1.

Sample SQL Query:

SELECT col1, col2, col3, COUNT(*)
FROM ingestion_table
GROUP BY col1, col2, col3 -- Include all columns that define uniqueness
HAVING COUNT(*) > 1;

55. Compute daily revenue growth % using lag window function

Explanation:

Calculates daily revenue, then uses `LAG()` to get the previous day's revenue to compute growth percentage.

Sample SQL Query:

WITH DailyRevenue AS (
    SELECT CAST(sale_date AS DATE) AS sale_day, SUM(amount) AS daily_revenue
    FROM sales
    GROUP BY CAST(sale_date AS DATE)
)
SELECT sale_day, daily_revenue,
       LAG(daily_revenue, 1, 0) OVER (ORDER BY sale_day) AS previous_day_revenue,
       (daily_revenue - LAG(daily_revenue, 1, 0) OVER (ORDER BY sale_day)) * 100.0 / NULLIF(LAG(daily_revenue, 1, 0) OVER (ORDER BY sale_day), 0) AS growth_percent
FROM DailyRevenue
ORDER BY sale_day;

56. Identify products with declining sales 3 months in a row

Explanation:

Calculates monthly sales, then uses `LAG()` to compare current month's sales with the previous two months.

Sample SQL Query:

WITH MonthlySales AS (
    SELECT product_id,
           DATE_TRUNC('month', sale_date) AS sales_month,
           SUM(amount) AS monthly_sales
    FROM sales
    GROUP BY product_id, DATE_TRUNC('month', sale_date)
), LaggedSales AS (
    SELECT product_id, sales_month, monthly_sales,
           LAG(monthly_sales, 1) OVER (PARTITION BY product_id ORDER BY sales_month) AS prev_month_sales_1,
           LAG(monthly_sales, 2) OVER (PARTITION BY product_id ORDER BY sales_month) AS prev_month_sales_2
    FROM MonthlySales
)
SELECT DISTINCT product_id
FROM LaggedSales
WHERE monthly_sales < prev_month_sales_1
  AND prev_month_sales_1 < prev_month_sales_2;

57. Get users with at least 3 logins per week over last 2 months

Explanation:

Groups logins by user and week, then filters for users meeting the login count criteria over the specified period.

Sample SQL Query:

WITH WeeklyLogins AS (
    SELECT user_id,
           DATE_TRUNC('week', login_timestamp) AS login_week,
           COUNT(*) AS weekly_login_count
    FROM user_logins
    WHERE login_timestamp >= DATEADD(month, -2, CURRENT_DATE)
    GROUP BY user_id, DATE_TRUNC('week', login_timestamp)
)
SELECT DISTINCT user_id
FROM WeeklyLogins
WHERE weekly_login_count >= 3
GROUP BY user_id
HAVING COUNT(login_week) >= 8; -- Assuming 8 weeks in 2 months for consistent check

58. Rank users by frequency of login in the current quarter

Explanation:

Counts logins per user in the current quarter, then ranks them based on this count.

Sample SQL Query:

SELECT user_id, login_count,
       RANK() OVER (ORDER BY login_count DESC) AS login_rank
FROM (
    SELECT user_id, COUNT(*) AS login_count
    FROM user_logins
    WHERE login_timestamp >= DATE_TRUNC('quarter', CURRENT_DATE)
      AND login_timestamp < DATE_TRUNC('quarter', CURRENT_DATE) + INTERVAL '3 month'
    GROUP BY user_id
) AS quarterly_logins
ORDER BY login_rank;

59. Fetch users who purchased same product multiple times in one day

Explanation:

Groups purchases by user, product, and date, then filters for groups with more than one purchase.

Sample SQL Query:

SELECT user_id, product_id, CAST(purchase_date AS DATE) AS purchase_day, COUNT(*) AS purchase_count
FROM purchases
GROUP BY user_id, product_id, CAST(purchase_date AS DATE)
HAVING COUNT(*) > 1;

60. Detect and delete late-arriving data for current month partitions

Explanation:

This is a conceptual approach for a partitioned table. You'd typically have a `load_timestamp` or similar column to identify when data *arrived* vs. its `event_timestamp`. Late data is identified by `event_timestamp` being in a 'closed' partition (e.g., previous month) but `load_timestamp` is current. Deletion then targets these specific records.

Conceptual SQL Query (assuming `event_date` is partition key, `load_date` is ingestion date):

-- Identify late-arriving data for the current month's partition
SELECT *
FROM your_partitioned_table
WHERE CAST(event_timestamp AS DATE) < DATE_TRUNC('month', CURRENT_DATE) -- Data belongs to previous month
  AND CAST(load_timestamp AS DATE) >= DATE_TRUNC('month', CURRENT_DATE); -- But loaded this month

-- To delete (use with extreme caution after careful testing):
-- DELETE FROM your_partitioned_table
-- WHERE CAST(event_timestamp AS DATE) < DATE_TRUNC('month', CURRENT_DATE)
--   AND CAST(load_timestamp AS DATE) >= DATE_TRUNC('month', CURRENT_DATE);

61. Get top 5 products by profit margin across all categories

Explanation:

Calculates profit margin for each product, then ranks and selects the top 5.

Sample SQL Query:

SELECT product_id, product_name, profit_margin
FROM (
    SELECT p.product_id, p.product_name,
           (SUM(s.sales_amount) - SUM(s.cost_amount)) * 1.0 / NULLIF(SUM(s.sales_amount), 0) AS profit_margin,
           ROW_NUMBER() OVER (ORDER BY (SUM(s.sales_amount) - SUM(s.cost_amount)) * 1.0 / NULLIF(SUM(s.sales_amount), 0) DESC) AS rnk
    FROM products p
    JOIN sales s ON p.product_id = s.product_id
    GROUP BY p.product_id, p.product_name
) ranked_products
WHERE rnk <= 5
ORDER BY profit_margin DESC;

62. Compare rolling 30-day revenue vs previous 30-day window

Explanation:

Uses a CTE to get daily revenue, then applies window functions to calculate current and previous 30-day rolling sums.

Sample SQL Query:

WITH DailyRevenue AS (
    SELECT CAST(sale_date AS DATE) AS sale_day, SUM(amount) AS daily_revenue
    FROM sales
    GROUP BY CAST(sale_date AS DATE)
)
SELECT sale_day,
       SUM(daily_revenue) OVER (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS current_30day_revenue,
       LAG(SUM(daily_revenue) OVER (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW), 30) AS previous_30day_revenue
FROM DailyRevenue
ORDER BY sale_day;

63. Flag transactions happening outside business hours

Explanation:

Uses `CASE` statement to check if the transaction time falls outside a defined business hour range (e.g., 9 AM to 5 PM).

Sample SQL Query:

SELECT transaction_id, transaction_timestamp,
       CASE
           WHEN CAST(transaction_timestamp AS TIME) < '09:00:00' OR CAST(transaction_timestamp AS TIME) >= '17:00:00'
           THEN 'Outside Business Hours'
           ELSE 'During Business Hours'
       END AS business_hours_flag
FROM transactions;

64. Write an optimized SQL query using broadcast join hints for small lookup tables

Explanation:

A broadcast join hint (syntax varies by database, e.g., `/*+ BROADCAST(JOIN(lookup_table) */` in some SQL dialects, or implicitly handled by optimizers for small tables) tells the optimizer to broadcast the smaller table to all nodes, avoiding a shuffle of the larger table. This is highly efficient for joining a large fact table with a small dimension table.

Sample SQL Query (conceptual, syntax depends on specific SQL dialect/DB):

-- Example for SQL Server (using OPTION)
SELECT /*+ HINT(OPTION (HASH JOIN, FORCE ORDER)) */
       f.fact_column, d.dimension_column
FROM
    FactTable f
INNER JOIN
    DimensionTable d ON f.dim_key = d.dim_key;

-- Example for Spark SQL (using hint)
SELECT /*+ BROADCAST(d) */ f.fact_column, d.dimension_column
FROM
    FactTable f
INNER JOIN
    DimensionTable d ON f.dim_key = d.dim_key;

-- In most modern databases, the optimizer will automatically choose a broadcast hash join
-- if one table is small enough, so explicit hints are often not needed unless
-- you want to override default behavior or for specific edge cases.

ETL & Data Warehouse Concepts

This section covers ETL optimization, data warehouse design, and CDC implementation strategies.

65. Your source has 10 million records. How will you optimize the ETL job?

Solution / Answer Ideas:

  • Use partitioning (Hash or Range) to parallelize processing
  • Minimize in-memory transformations (avoid Lookups on large data)
  • Push logic to source DB using SQL override
  • Use bulk loading or batch commits

66. How do you handle a scenario where some records are rejected during transformation?

Solution / Answer Ideas:

  • Redirect bad records to a reject file/table
  • Capture error message, source key, and timestamp for debugging
  • Use error handling stages (e.g., Reject links in DataStage, Error log in Informatica)

67. You need to pass different file paths for DEV, QA, and PROD. How would you do this?

Solution / Answer Ideas:

  • Use parameter files or environment variables
  • DataStage: ParamSet or DSParam file
  • Informatica: Parameter file with $$SourceFilePath

68. How do you implement incremental load (CDC)?

Solution / Answer Ideas:

  • Use Last Updated Timestamp or Surrogate Key
  • Store last load timestamp in a control table
  • Filter source using WHERE last_update > :last_loaded_time
  • Informatica: CDC mappings or Change Data Capture tools

69. How would you load the most recent record per customer from a transaction table?

Solution / Answer Ideas:

  • Sort by Customer ID and Date in descending order
  • Use row_number() or stage logic to pick only first record
  • Informatica: Use Sorter + Expression to flag first row
  • DataStage: Use Remove Duplicates or Transformer with stage variables

70. How do you implement Slowly Changing Dimensions (SCD) in a data warehouse?

Explanation:

SCD Type 1: Overwrite old data.

Type 2: Add new row with versioning/timestamps.

Type 3: Add new column for previous value. Use ETL tools or SQL merge/upsert logic to manage SCDs.

71. Explain the concept of star schema and snowflake schema in data modeling

Explanation:

Star schema: Central fact table linked to denormalized dimension tables; simple, fast queries.

Snowflake schema: Dimensions are normalized into multiple related tables; reduces redundancy but more complex joins.

72. How would you design a fact table for an e-commerce platform?

Explanation:

Include transaction-level facts (e.g., sales amount, quantity), foreign keys to dimensions (date, product, customer), and measures (discount, tax). Ensure granularity matches business needs (e.g., order line item).

Spark & Big Data Concepts

This section covers Apache Spark optimization, distributed computing, and big data processing concepts.

73. Explain the variance between repartition and coalesce in Spark

Explanation:

Repartition: Increases/decreases partitions, shuffles data.

Coalesce: Decreases partitions, minimizes shuffle.

74. Clarify the disparities between caching and persisting in Spark

Explanation:

Cache: Stores RDD/DataFrame in memory (default).

Persist: Stores in memory or disk, with configurable storage levels.

75. What constitutes a Distributed Table and what are its Types?

Explanation:

Table split across nodes for parallel processing.

Types: Sharded, Replicated, Partitioned.

76. Elaborate on Spark's optimization techniques

Explanation:

Catalyst optimizer, Tungsten execution, predicate pushdown, broadcast joins, partition pruning, caching.

77. Highlight the variations between wide and narrow transformations in Spark

Explanation:

Narrow: Data from one partition needed (e.g., map, filter).

Wide: Data shuffled across partitions (e.g., groupBy, join).

78. Define Autoscaling and Auto-Termination

Explanation:

Autoscaling: Dynamically adjusts cluster resources.

Auto-Termination: Shuts down cluster after inactivity.

79. Discuss the disparities between RDD and Dataframe

Explanation:

RDD: Low-level, type-safe, no schema, less optimized.

DataFrame: High-level, schema, optimized via Catalyst.

80. Break down the Parquet file format and its advantages

Explanation:

Columnar storage, compression, schema evolution, efficient for analytics, supports predicate pushdown.

81. Define deltalake and expound on versioning intricacies

Explanation:

Delta Lake: ACID transactions on data lakes, supports time travel/versioning, schema enforcement, and rollback.

82. How do you perform schema evolution in Delta Lake?

Explanation:

Use Delta Lake's mergeSchema option to allow new columns. E.g., df.write.option('mergeSchema', 'true').format('delta').mode('append').save(path). Delta automatically tracks schema changes.

83. How would you design a data pipeline to handle both batch and streaming data?

Explanation:

Use a unified architecture: ingest batch data via scheduled jobs and streaming data via Spark Structured Streaming. Store both in a common data store (e.g., Delta Lake) and process with the same transformation logic.

Python & PySpark Coding Challenges

This section covers practical Python and PySpark coding scenarios commonly asked in interviews.

84. Write a Python script to read a CSV file and load it into a DataFrame

Python Solution:

import pandas as pd
df = pd.read_csv('filename.csv')
print(df.head())

85. How do you handle exceptions in Python using try-except blocks?

Python Solution:

try:
    # risky code
    result = 10 / 0
except ZeroDivisionError as e:
    print(f'Error: {e}')
finally:
    print('Cleanup or final steps')

86. In PySpark, how would you perform a join operation between two large DataFrames efficiently?

PySpark Solution / Explanation:

Use broadcast() for small DataFrames to avoid shuffles:

from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), 'key')
# For large-large joins, ensure both are partitioned on the join key.

87. Write a PySpark code to find the top 3 customers with the highest revenue per region

PySpark Solution:

from pyspark.sql.window import Window
import pyspark.sql.functions as F
window = Window.partitionBy('region').orderBy(F.desc('revenue'))
df.withColumn('rank', F.row_number().over(window)) \
  .filter('rank <= 3') \
  .select('customer_id', 'region', 'revenue')

88. What is the difference between partitioning and bucketing in PySpark?

Explanation:

Partitioning: Splits data into directories based on column values, improving query performance for partitioned columns.

Bucketing: Divides data into fixed buckets using a hash function, enabling efficient joins and sampling even if the bucket column isn't in the filter.

89. Write a PySpark code to process streaming data from Event Hub in Databricks

PySpark Solution:

from pyspark.sql.types import StringType
event_hub_conf = { 'eventhubs.connectionString': '' }
df = (spark.readStream.format('eventhubs')
      .options(**event_hub_conf)
      .load())
df.writeStream.format('console').start().awaitTermination()

90. Write a PySpark script to load data from ADLS into a Delta table

PySpark Solution:

df = spark.read.format('csv').option('header', 'true').load('abfss://container@account.dfs.core.windows.net/path')
df.write.format('delta').mode('overwrite').save('/mnt/delta/table')

91. Write a Python script to validate data quality and detect anomalies

Python Solution:

import pandas as pd
df = pd.read_csv('data.csv')
print(df.isnull().sum())  # missing values
print(df.describe())  # summary stats
anomalies = df[df['value'] > df['value'].mean() + 3*df['value'].std()]
print(anomalies)

92. Write a PySpark code to perform window functions for ranking sales data

PySpark Solution:

from pyspark.sql.window import Window
import pyspark.sql.functions as F
window = Window.partitionBy('region').orderBy(F.desc('sales'))
df.withColumn('rank', F.rank().over(window)).show()

93. Explain the differences between lists, tuples, sets, and dictionaries in Python

Detailed Interview Response:

  • Lists: Ordered, mutable collections (e.g., [1,2,3]), used for sequences.
  • Tuples: Ordered, immutable (e.g., (1,2,3)), used for fixed data.
  • Sets: Unordered, unique items (e.g., {1,2,3}), used for membership tests.
  • Dictionaries: Key-value pairs (e.g., {'a':1}), used for fast lookups and mapping relationships.

94. How would you use pandas to merge two datasets and calculate total sales for products with valid promotions?

Python Solution:

import pandas as pd
sales = pd.read_csv('sales.csv')
promos = pd.read_csv('promotions.csv')
merged = pd.merge(sales, promos, on='product_id')
valid = merged[merged['promotion_active'] == True]
total_sales = valid.groupby('product_id')['sales_amount'].sum()
print(total_sales)

Python Interview Questions

Core Python concepts and interview questions covering data structures, memory management, and OOP principles.

95. What are the differences between a list and a tuple in Python?

Sample Response:

Lists are mutable, meaning their elements can be changed, while tuples are immutable. Lists use square brackets [ ], tuples use parentheses ( ). Tuples are generally faster and can be used as dictionary keys if they contain only immutable elements.

96. How does Python handle memory management and garbage collection?

Sample Response:

Python uses automatic memory management with a built-in garbage collector that reclaims memory by reference counting and cyclic garbage collection. The gc module can be used to interact with the garbage collector.

97. Can you explain the concept of list comprehensions? Provide an example

Sample Response:

List comprehensions provide a concise way to create lists. Example: [x*x for x in range(5)] creates [0, 1, 4, 9, 16].

98. What is the difference between deep copy and shallow copy in Python?

Sample Response:

A shallow copy creates a new object but does not create copies of nested objects; changes to nested objects affect both copies. A deep copy creates a new object and recursively copies all nested objects, so changes do not affect the original.

99. How do you handle exceptions in Python? Can you give an example of custom exception handling?

Sample Response:

Use try, except blocks. Custom exceptions are created by subclassing Exception. Example: class MyError(Exception): pass and then try: ... except MyError: ...

100. What are *args and **kwargs in Python functions? When would you use them?

Sample Response:

*args allows a function to accept any number of positional arguments, while **kwargs allows for any number of keyword arguments. Useful when you don't know beforehand how many arguments will be passed.

101. Explain the difference between a class method, static method, and instance method

Sample Response:

Instance methods operate on the object instance and can access/modify object state. Class methods use @classmethod and take cls as the first argument; they can access/modify class state. Static methods use @staticmethod and don't access class or instance state.

102. What is inheritance in Python? Can you provide a simple example?

Sample Response:

Inheritance allows a class to inherit attributes and methods from another class. Example: class Animal: ... then class Dog(Animal): ...

103. How does Python's method resolution order (MRO) work in multiple inheritance?

Sample Response:

MRO determines the order in which base classes are searched when executing a method. Python uses the C3 linearization algorithm, accessible via ClassName.__mro__.

104. How do you read and write data from/to a CSV file using Python?

Sample Response:

Use the csv module or Pandas. Example: import csv; with open('file.csv') as f: reader = csv.reader(f) for reading. With Pandas: pd.read_csv('file.csv').

105. What are some common methods provided by the Pandas DataFrame object?

Sample Response:

Common methods include head(), tail(), info(), describe(), groupby(), merge(), drop(), fillna(), and apply().

106. How would you handle missing data in a Pandas DataFrame?

Sample Response:

Use methods like dropna() to remove missing values or fillna() to replace them with a specific value or method (e.g., mean, median).

107. Can you explain how to merge or join two DataFrames in Pandas?

Sample Response:

Use the merge() function for SQL-style joins or concat() for stacking DataFrames vertically or horizontally. Example: pd.merge(df1, df2, on='key').

108. How do you schedule a Python script to run automatically (e.g., daily)?

Sample Response:

On Linux, use cron jobs; on Windows, use Task Scheduler. Alternatively, use Python libraries like schedule or APScheduler for in-app scheduling.

109. What is the use of the __init__.py file in a Python package?

Sample Response:

__init__.py marks a directory as a Python package and can be used to execute package initialization code or set the __all__ variable.

110. How can you parse JSON data in Python?

Sample Response:

Use the json module: import json; data = json.loads(json_string) to parse, and json.dumps(obj) to serialize.

111. What is a generator in Python? How is it different from a normal function?

Sample Response:

Generators use yield to return values one at a time, maintaining state between calls, which makes them memory efficient for large datasets. Normal functions return all values at once with return.

112. How do you use decorators in Python? Can you provide an example?

Sample Response:

Decorators are functions that modify the behavior of other functions. Example: @my_decorator above a function definition.

113. Explain the Global Interpreter Lock (GIL) in Python and its implications for multithreading

Sample Response:

The GIL allows only one thread to execute Python bytecode at a time, which can limit CPU-bound multithreaded programs. For I/O-bound tasks, threading is still useful. For CPU-bound tasks, multiprocessing is recommended.

114. How would you optimize a Python script that is running slowly due to processing a large dataset?

Sample Response:

Use efficient data structures, vectorized operations with NumPy/Pandas, avoid loops where possible, use generators, and profile code to identify bottlenecks.

115. How do you add an element to a Python list?

Exact Answer:

Use the append() method. Example: my_list.append(value)

116. How do you add an element to a Python set?

Exact Answer:

Use the add() method. Example: my_set.add(value)

117. What is the difference between a Python list and a tuple?

Exact Answer:

Lists are mutable (can be changed), tuples are immutable (cannot be changed). Lists use [ ], tuples use ( ).

118. How do you create a virtual environment in Python?

Exact Answer:

Run python -m venv env and activate it with source env/bin/activate (Linux/Mac) or env\Scripts\activate (Windows).

119. What is the difference between REST and SOAP APIs?

Exact Answer:

REST is an architectural style using HTTP and is stateless, typically uses JSON, and is simpler. SOAP is a protocol, uses XML, is more rigid, and supports advanced features like security and transactions.

120. What is the purpose of the __init__ method in Python classes?

Exact Answer:

__init__ is the constructor method called when a new object is created from a class. It initializes the object's attributes.

Advanced SQL Scenario Queries

Complex multi-table scenarios and advanced SQL techniques.

168. Scenario: You have three tables: "Employees," "Departments," and "Salaries." The "Employees" table has the following columns: EmployeeID, EmployeeName, DepartmentID. The "Departments" table has the following columns: DepartmentID, DepartmentName. The "Salaries" table has the following columns: EmployeeID, Salary, EffectiveDate. Write a SQL query to retrieve the employee who has had the highest salary increase within the last year, along with their name, department, and the percentage increase.

Explanation / Use Case:

Uses LAG to find salary changes, calculates percentage increase, returns employee with highest increase.

Sample SQL Query:

WITH SalaryChanges AS ( SELECT s.EmployeeID, s.Salary, s.EffectiveDate, LAG(s.Salary) OVER (PARTITION BY s.EmployeeID ORDER BY s.EffectiveDate) AS P FROM Salaries s WHERE s.EffectiveDate >= DATEADD(YEAR, -1, GETDATE()) ) , Increases AS ( SELECT EmployeeID, (Salary - PrevSalary) AS SalaryIncrease, PrevSalary, Salary FROM SalaryChanges WHERE PrevSalary IS NOT NULL AND PrevSalary > 0 ) SELECT TOP 1 e.EmployeeName, d.DepartmentName, SalaryIncrease * 100.0 / PrevSalary AS PercentageIncrease FROM Increases i JOIN Employees e ON i.EmployeeID = e.EmployeeID JOIN Departments d ON e.DepartmentID = d.DepartmentID ORDER BY PercentageIncrease DESC;

169. Scenario: You have a table named "Logs" with the following columns: LogID, LogTime, UserID. Write a SQL query to find the top 5 users who have logged in the most consecutive days, along with the number of consecutive days.

Explanation / Use Case:

Finds login streaks using date arithmetic, counts streaks, returns top 5 users.

Sample SQL Query:

WITH UserLogs AS ( SELECT UserID, CAST(LogTime AS DATE) AS LogDate FROM Logs GROUP BY UserID, CAST(LogTime AS DATE) ), Streaks AS ( SELECT UserID, LogDate, ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY LogDate) AS rn, DATEADD(DAY, -ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY LogDate), LogD FROM UserLogs ), ConsecutiveCounts AS ( SELECT UserID, COUNT(*) AS ConsecutiveDays FROM Streaks GROUP BY UserID, grp ) SELECT TOP 5 UserID, MAX(ConsecutiveDays) AS MaxConsecutiveDays FROM ConsecutiveCounts GROUP BY UserID ORDER BY MaxConsecutiveDays DESC;

170. Scenario: Calculate the average transaction amount for each user, including users who have no transactions, and return the result as zero for those users.

Explanation / Use Case:

LEFT JOIN ensures all users are included; COALESCE returns zero for users with no transactions.

Sample SQL Query:

SELECT u.UserID, COALESCE(AVG(t.Amount), 0) AS AvgTransaction FROM Users u LEFT JOIN Transactions t ON u.UserID = t.UserID GROUP BY u.UserID;

171. Scenario: You have two tables: "Customers" and "Purchases." The "Customers" table has the following columns: CustomerID, CustomerName. The "Purchases" table has the following columns: PurchaseID, PurchaseDate, CustomerID, ProductID. Write a SQL query to find the customers who have purchased all products.

Explanation / Use Case:

Compares count of products purchased by customer to total products; returns those who bought all.

Sample SQL Query:

SELECT c.CustomerID, c.CustomerName FROM Customers c JOIN Purchases p ON c.CustomerID = p.CustomerID GROUP BY c.CustomerID, c.CustomerName HAVING COUNT(DISTINCT p.ProductID) = (SELECT COUNT(DISTINCT ProductID) FROM Purchases);

SQL Date/Time Calculations

Essential date and time manipulation queries for business logic.

172. Number of days between two dates, excluding weekends

Explanation / Notes:

Loops through each date between two dates and counts only weekdays. Replace @start_date and @end_date with your date columns or variables.

Corrected SQL Code:

SELECT SUM(CASE WHEN DATENAME(WEEKDAY, d) NOT IN ('Saturday', 'Sunday') THEN 1 ELSE 0 E FROM ( SELECT DATEADD(DAY, n, @start_date) AS d FROM (SELECT TOP (DATEDIFF(DAY, @start_date, @end_date) + 1) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS n FROM master..spt_values) t ) days;

173. First day of the previous month

Explanation / Notes:

Finds the first day of the previous month by truncating to month and subtracting one.

Corrected SQL Code:

SELECT DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()) - 1, 0) AS first_day_prev_month;

174. Last day of the next month

Explanation / Notes:

Moves to the first day of the month after next, then subtracts one day.

Corrected SQL Code:

SELECT DATEADD(DAY, -1, DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()) + 2, 0)) AS last_d

175. Hours between two timestamps, considering only business hours (9am-5pm)

Explanation / Notes:

Adjusts timestamps to business hours if outside 9am-5pm. Assumes both timestamps are on the same day. For multi-day spans, logic must be extended.

Corrected SQL Code:

SELECT CASE WHEN CAST(t1.timestamp AS TIME) < '09:00' THEN '09:00' ELSE CAST(t1.timestamp A CASE WHEN CAST(t2.timestamp AS TIME) > '17:00' THEN '17:00' ELSE CAST(t2.timestamp A DATEDIFF(HOUR, CASE WHEN CAST(t1.timestamp AS TIME) < '09:00' THEN DATEADD(HOUR, 9, CAST(CAST( CASE WHEN CAST(t2.timestamp AS TIME) > '17:00' THEN DATEADD(HOUR, 17, CAST(CAST( ) AS business_hours_diff FROM table1 t1 JOIN table2 t2 ON (join_condition);

176. Date of next occurrence of a specific weekday (e.g., next Wednesday)

Explanation / Notes:

Calculates days to add to reach the next desired weekday. Adjust @target_weekday as needed (1=Sunday, 2=Monday, ..., 7=Saturday).

Corrected SQL Code:

DECLARE @target_weekday INT = 4; -- 1=Sunday, 4=Wednesday SELECT DATEADD(DAY, ((@target_weekday - DATEPART(WEEKDAY, GETDATE()) + 7) % 7) + 1, GET

Advanced Python & Ecosystem Concepts

Advanced topics covering frameworks, APIs, data processing, database optimization, and DevOps practices.

121. Python and Frameworks: Can you explain the differences between Django, Flask, and FastAPI? When would you choose one over the others?

Expected Answer:

Django is a full-stack web framework offering a lot of built-in features, suitable for larger applications requiring an ORM and admin interface. Flask is a micro-framework, ideal for smaller applications or services where flexibility is needed. FastAPI is designed for building APIs quickly with automatic generation of OpenAPI documentation, and is particularly useful for asynchronous applications.

122. RESTful API Development: Describe the process you follow to design and implement a RESTful API. What are some best practices you adhere to?

Expected Answer:

The process involves defining the API endpoints and their methods (GET, POST, PUT, DELETE), ensuring data validation and error handling, and implementing authentication and authorization. Best practices include using consistent naming conventions, versioning APIs, and providing comprehensive documentation.

123. Data Processing with Pandas and NumPy: How have you used Pandas and NumPy for data transformation in your projects? Can you provide a specific example?

Expected Answer:

Pandas is used for data manipulation and analysis, providing data structures like DataFrames for handling complex datasets. NumPy is used for numerical computations, offering efficient array operations. An example might be cleaning and transforming data from CSV files using Pandas, and performing mathematical operations using NumPy.

124. Database Optimization: What strategies do you use to optimize database queries and interactions in MySQL or PostgreSQL?

Expected Answer:

Strategies include indexing columns that are frequently queried, optimizing query structure, using joins efficiently, and caching results where possible. Regularly analyzing query performance and adjusting based on database usage patterns is also crucial.

125. Deployment and DevOps: How do you approach deploying applications on Linux servers using Docker and Jenkins?

Expected Answer:

Deployment involves containerizing them using Docker to ensure consistency across environments, and using Jenkins for continuous integration and deployment. This process typically includes setting up automated build and test pipelines, and managing deployments on Linux servers.

126. OCR and Document Processing: Can you explain how you integrated Tesseract OCR and OpenCV in your OCR-based document processing project? What challenges did you face?

Expected Answer:

Tesseract OCR is used for text extraction from images and PDFs, while OpenCV is used for pre-processing images to improve OCR accuracy. Challenges may include handling varied document formats and ensuring high accuracy in text extraction, which can be addressed through image enhancement techniques and fine-tuning OCR settings.

Azure Data Platform Questions

Comprehensive questions covering Azure Data Factory, Databricks, Synapse Analytics, and Azure ecosystem.

127. How do you build an ETL pipeline using Azure Data Factory?

Explanation:

Define linked services (data sources), create datasets, build pipelines with activities (Copy, Data Flow, etc.), use triggers for scheduling, and monitor pipeline runs. Parameterize for reusability and handle errors with activities like Web or Stored Procedure.

128. What are the different types of triggers in ADF and when to use them?

Explanation:

Schedule Trigger: Runs at specific times.

Tumbling Window: Fixed-size, non-overlapping intervals for batch processing.

Event-based: Responds to events like file arrival. Use based on data arrival pattern and business requirements.

129. Explain the architecture of Azure Databricks and its integration with Delta Lake

Explanation:

Azure Databricks is a managed Spark platform with collaborative notebooks. It integrates with Delta Lake for ACID transactions, schema enforcement, and time travel on data lakes. Data is stored in Azure Data Lake Storage and processed via Databricks clusters.

130. How do you optimize query performance in Azure Synapse Analytics?

Explanation:

Use result set caching, materialized views, partitioned tables, proper distribution (hash/round robin/replicate), minimize data movement, and leverage indexes. Monitor with Query Performance Insight.

131. How would you design a data warehouse for a retail business using Synapse?

Explanation:

Use star/snowflake schema for sales, inventory, customer, and product data. Store raw data in Data Lake, use Synapse pipelines for ETL, and create dedicated SQL pools for analytics. Partition and distribute tables for performance.

132. What are the best practices for securing data in Azure Data Lake Storage?

Explanation:

Use RBAC and ACLs for granular access, enable encryption at rest and in transit, use private endpoints, monitor with Azure Monitor, and enable firewall rules.

133. How do you manage access control and secrets using Azure Key Vault?

Explanation:

Store secrets, keys, and certificates in Key Vault. Grant access via Azure AD, use managed identities for services, and reference secrets directly in ADF, Databricks, or Synapse configurations.

134. How do you implement data lineage and governance in Microsoft Purview?

Explanation:

Register data sources, scan assets, and use Purview to track lineage from ingestion to consumption. Define business glossary, classify sensitive data, and monitor data access and movement.

135. Build a real-time analytics pipeline using Event Hub, Stream Analytics, and Synapse

Explanation:

Ingest streaming data with Event Hub, process/aggregate in Azure Stream Analytics, and output to Synapse Analytics for real-time dashboards and reporting.

136. How would you handle late-arriving data in a batch ETL pipeline?

Explanation:

Use watermarking and windowing to process late data, design ETL to allow reprocessing, and maintain audit logs for late arrivals. In ADF, use tumbling window triggers with late arrival tolerance.

137. How do you implement incremental data loading in ADF pipelines?

Explanation:

Use watermark columns (e.g., last modified date), store high-watermark values, and filter source data for new/changed records. Use ADF's built-in incremental copy or custom logic with parameters.

138. How do you optimize storage and query performance in a Synapse dedicated pool?

Explanation:

Choose appropriate distribution (hash/round robin/replicate), partition large tables, use materialized views, minimize data movement, compress data, and regularly update statistics.

139. Build an end-to-end data pipeline that ingests data from multiple sources, transforms it in Databricks, and loads it into Synapse for reporting

Explanation:

Ingest data from sources (ADLS, SQL, APIs) using ADF, process and transform in Databricks (cleaning, enrichment, aggregation), write results to Synapse dedicated pool, and build Power BI dashboards for reporting. Use orchestration and monitoring for reliability.

Power BI Interview Questions

Key Power BI concepts covering data modes, security, and visualization techniques.

140. Explain the difference between Import and Direct Query modes. Which would you choose for large datasets?

Detailed Interview Response:

Import: Loads data into Power BI for fast, in-memory analysis but is static and needs refreshing.

Direct Query: Leaves data in the source, enabling real-time queries but may be slower and limited by source performance.

For very large datasets, use Direct Query to avoid memory constraints, but optimize source performance.

141. What are slicers, and how do they differ from visual-level filters? Discuss their impact on data in a Power BI dashboard

Detailed Interview Response:

Slicers: Are visual controls that let users filter data interactively across multiple visuals.

Visual-level filters: Apply only to a single visual.

Slicers provide a dashboard-wide filtering experience, improving interactivity and user-driven analysis.

142. How do you implement Row-Level Security (RLS) in Power BI? Explain how you would restrict data access to specific users or groups

Detailed Interview Response:

Define roles and DAX filters in Power BI Desktop (e.g., [Country] = USERPRINCIPALNAME()), Publish to Power BI Service, assign users/groups to roles. RLS ensures users see only data relevant to them, enforcing data privacy.

143. What is a paginated report, and when would you use it?

Detailed Interview Response:

Paginated reports are pixel-perfect, printable reports ideal for multi-page outputs like invoices or billing statements. They allow precise control over layout and are used when detailed, printable documents are required.

Guesstimate Questions

Analytical and estimation questions to test problem-solving and structured thinking.

144. Estimate the total number of hotel bookings made globally in a day on Booking.com. Explain the factors and assumptions you would consider to arrive at your estimate

Detailed Interview Response:

Break down by regions, estimate active users, average bookings per user, and seasonality. Use public data (market share, travel trends), adjust for weekends/holidays, and validate with industry reports. Show step-by-step logic and state assumptions clearly.

145. How many unique users do you think search for flights on Booking.com in a month? Provide a structured approach to your calculation

Detailed Interview Response:

Estimate total site traffic, % interested in flights, average searches per user, and repeat users. Use funnel analysis: total visitors → % flight searchers → unique users. State assumptions and adjust for mobile/web split.

Case Study Questions

Real-world scenario-based questions to assess analytical and problem-solving skills.

146. You notice a sudden drop in conversion rates (from search to booking) for hotels in a particular city. How would you investigate the root cause and propose solutions?

Detailed Interview Response:

Analyze funnel data (searches, clicks, bookings), segment by device, channel, and user type. Check for UI changes, pricing issues, inventory problems, or external events. Propose A/B tests, user surveys, and cross-team collaboration to address findings.

147. Booking.com is launching a new feature that allows users to book multi-city trips. How would you measure its post-launch, and what metrics would you track to ensure its adoption and profitability?

Detailed Interview Response:

Track adoption rate (feature usage), conversion rate, average booking value, customer retention, and feedback. Compare with single-city bookings, monitor technical performance, and analyze user cohorts for repeat usage and upsell/cross-sell impact.

General Data Engineering & Project Experience

Questions about your personal project experience and contributions.

148. Can you elaborate on your current project and the specific contributions you make to it?

Sample Answer:

Briefly describe the project domain, tech stack, business goal, and your roles (e.g., ETL design, data modeling, performance tuning, automation, etc.).

Advanced SQL Scenarios & Functions

Advanced SQL queries covering window functions, aggregations, and complex scenarios.

149. Write a query to find the second-highest salary in a department. You might use ROW_NUMBER() or DENSE_RANK() to achieve this.

Detailed Interview Response / Example:

SELECT department_id, salary FROM ( SELECT department_id, salary, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rn FROM employees ) ranked WHERE rn = 2; -- DENSE_RANK() can be used if you want to handle ties.

150. Create a query to calculate the total number of transactions per user for each day. This typically involves GROUP BY and COUNT() for aggregation.

Detailed Interview Response / Example:

SELECT user_id, transaction_date, COUNT(*) AS total_transactions FROM transactions GROUP BY user_id, transaction_date;

151. Write a query to select projects with the highest budget-per-employee ratio from two related tables (projects and employees). This tests your ability to work with complex joins and aggregations.

Detailed Interview Response / Example:

SELECT p.project_id, p.project_name, (p.budget / COUNT(e.employee_id)) AS budget_per_em FROM projects p JOIN employees e ON p.project_id = e.project_id GROUP BY p.project_id, p.project_name, p.budget ORDER BY budget_per_employee DESC LIMIT 1;

152. Find the second-highest salary in a table without using LIMIT or TOP.

Explanation / Correction:

Finds the maximum salary less than the overall max, giving the second-highest.

Sample SQL Answer:

SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);

153. Write a SQL query to find all employees who earn more than their managers.

Explanation / Correction:

Self-join employees to their managers and compare salaries.

Sample SQL Answer:

SELECT e1.* FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.id WHERE e1.salary > e2.salary;

154. Find the duplicate rows in a table without using GROUP BY.

Explanation / Correction:

Uses EXISTS to find duplicates based on columns, avoiding GROUP BY.

Sample SQL Answer:

SELECT t1.* FROM table_name t1 WHERE EXISTS ( SELECT 1 FROM table_name t2 WHERE t1.column1 = t2.column1 AND t1.column2 = t2.column2 AND t1.rowid <> t2.rowid );

155. Write a SQL query to find the top 10% of earners in a table.

Explanation / Correction:

NTILE(10) splits data into 10 buckets; decile 1 is top 10%.

Sample SQL Answer:

SELECT * FROM ( SELECT *, NTILE(10) OVER (ORDER BY salary DESC) AS decile FROM employees ) ranked WHERE decile = 1;

156. Find the cumulative sum of a column in a table.

Explanation / Correction:

SUM() OVER (ORDER BY ...) computes running totals.

Sample SQL Answer:

SELECT *, column - LEAD(column) OVER (ORDER BY id) AS diff_with_next FROM table_name;

157. Write a SQL query to find all employees who have never taken a leave.

Explanation / Correction:

Finds employees whose IDs are not in the leave records.

Sample SQL Answer:

SELECT * FROM employees WHERE id NOT IN (SELECT employee_id FROM leaves);

158. Find the difference between the current row and the next row in a table.

Explanation / Correction:

LEAD() gets the next row's value for difference calculation.

Sample SQL Answer:

SELECT *, column - LEAD(column) OVER (ORDER BY id) AS diff_with_next FROM table_name;

159. Write a SQL query to find all departments with more than one employee.

Explanation / Correction:

GROUP BY and HAVING find departments with multiple employees.

Sample SQL Answer:

SELECT department FROM employees GROUP BY department HAVING COUNT(*) > 1;

160. Find the maximum value of a column for each group without using GROUP BY.

Explanation / Correction:

Window function MAX() OVER (PARTITION BY ...) finds group max per row.

Sample SQL Answer:

SELECT * FROM ( SELECT *, MAX(column) OVER (PARTITION BY group_column) AS max_in_group FROM table_name ) t WHERE column = max_in_group;

161. Write a SQL query to find all employees who have taken more than 3 leaves in a month.

Explanation / Correction:

Group by employee and month, then filter for those with >3 leaves.

Sample SQL Answer:

SELECT employee_id, EXTRACT(YEAR FROM leave_date) AS year, EXTRACT(MONTH FROM leave_dat FROM leaves GROUP BY employee_id, year, month HAVING COUNT(*) > 3;

SQL Scenario-Based Queries

Real-world SQL scenarios testing analytical and problem-solving abilities.

162. Scenario: Top N products by sales, Top N within each category, Top N employees by salaries

Explanation / Use Case:

Use ORDER BY with LIMIT for global top N; use ROW_NUMBER() OVER (PARTITION BY ...) for top N within groups.

Sample SQL Query:

-- Top N products overall SELECT product_id, SUM(sales) AS total_sales FROM sales GROUP BY product_id ORDER BY total_sales DESC LIMIT N; -- Top N products within each category SELECT * FROM ( SELECT product_id, category_id, SUM(sales) AS total_sales, ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY SUM(sales) DESC) AS rn FROM sales GROUP BY product_id, category_id ) ranked WHERE rn <= N;

163. Scenario: Year-over-year (YOY) growth, YOY by category, products with higher sales than previous month

Explanation / Use Case:

Use LAG() for YOY or month-over-month comparisons. Aggregate sales by time period and group.

Sample SQL Query:

-- YOY growth for each category SELECT category_id, year, sales, sales - LAG(sales) OVER (PARTITION BY category_id ORDER BY year) AS yoy_growth FROM ( SELECT category_id, EXTRACT(YEAR FROM sale_date) AS year, SUM(sales) AS sales FROM sales GROUP BY category_id, year ) yearly; -- Products with higher sales than previous month SELECT product_id, month, sales FROM ( SELECT product_id, EXTRACT(MONTH FROM sale_date) AS month, SUM(sales) AS sales, LAG(SUM(sales)) OVER (PARTITION BY product_id ORDER BY month) AS prev_month_ FROM sales GROUP BY product_id, month ) t WHERE sales > prev_month_sales;

164. Scenario: Running sales over months, rolling N months sales, within each category

Explanation / Use Case:

Use SUM() OVER with appropriate window frames for running or rolling totals.

Sample SQL Query:

-- Running total sales per product SELECT product_id, month, SUM(sales) OVER (PARTITION BY product_id ORDER BY month) AS r FROM sales; -- Rolling 3-month sales per category SELECT category_id, month, SUM(sales) OVER ( PARTITION BY category_id ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS rolling_3_month_sales FROM sales;

165. Scenario: Pivot rows to columns (e.g., year-wise sales for each category in separate columns)

Explanation / Use Case:

Use CASE WHEN inside aggregate functions to pivot data from rows to columns.

Sample SQL Query:

SELECT category_id, SUM(CASE WHEN year = 2021 THEN sales ELSE 0 END) AS sales_2021, SUM(CASE WHEN year = 2022 THEN sales ELSE 0 END) AS sales_2022 FROM ( SELECT category_id, EXTRACT(YEAR FROM sale_date) AS year, sales FROM sales ) t GROUP BY category_id;

166. Scenario: Number of records after different kinds of joins

Explanation / Use Case:

Different joins affect row counts based on matching and non-matching records. Useful for data quality checks and understanding data relationships.

Sample SQL Query:

-- INNER JOIN SELECT COUNT(*) FROM a INNER JOIN b ON a.id = b.id; -- LEFT JOIN SELECT COUNT(*) FROM a LEFT JOIN b ON a.id = b.id; -- RIGHT JOIN SELECT COUNT(*) FROM a RIGHT JOIN b ON a.id = b.id; -- FULL OUTER JOIN SELECT COUNT(*) FROM a FULL OUTER JOIN b ON a.id = b.id;

167. Scenario: Consider a table named "Orders" with the following columns: OrderID, OrderDate, CustomerID. Write a SQL query to find the customers who have placed orders on consecutive days.

Explanation / Use Case:

Joins Orders table to itself to find customers with orders on consecutive days.

Sample SQL Query:

SELECT DISTINCT o1.CustomerID FROM Orders o1 JOIN Orders o2 ON o1.CustomerID = o2.CustomerID AND DATEDIFF(DAY, o1.OrderDate, o2.OrderDate) = 1;

SQL Best Practices

Essential SQL optimization techniques and coding standards for efficient query writing.

Use EXISTS in place of IN wherever possible

EXISTS is often more efficient, especially with correlated subqueries, as it stops searching once a match is found.

Use table aliases with columns when joining multiple tables

Improves readability and avoids ambiguity, especially in complex queries.

Use GROUP BY instead of DISTINCT

GROUP BY is more flexible and can be optimized better by query optimizers.

Add useful comments for complex logic, avoid too many comments

Comments help others (and your future self) understand non-trivial logic, but excessive comments can clutter code.

Use joins instead of subqueries where better performance

Joins are typically more efficient and easier for queryoptimizers to handle than nested subqueries.

Use WHERE instead of HAVING to filter non-aggregate fields

WHERE filters rows before aggregation, improving performance. HAVING should be used only for aggregate conditions.

Avoid wildcards at the beginning of predicates (e.g., '%abc')

Leading wildcards prevent index usage, causing full table scans.

Consider cardinality within GROUP BY (unique columns first)

Grouping by columns with higher cardinality first can improve query performance by reducing intermediate result size.

Write SQL keywords in capital letters for readability

Capitalizing SQL keywords (SELECT, FROM, WHERE) enhances SQL statement readability and makes SQL statements easier to scan visually.

Never use SELECT *; always specify columns

Improves performance, reduces unnecessary data transfer, and makes breaking changes to schemas easier to detect.

Create CTEs instead of multiple subqueries

CTEs (WITH clauses) make queries easier to read, maintain, and debug.

Join tables using JOIN keywords, not WHERE clause

Increases readability and reduces risk of accidental cross joins or logic errors.

Never use ORDER BY in subqueries if ORDER BY in subqueries is ignored or wastes resources

Increases runtime without benefit.

Use UNION ALL instead of UNION when no duplicates

UNION ALL is faster because it skips duplicate elimination. Use UNION only when necessary.

Start WHERE clause with 1 = 1 for easy debugging

Allows easy commenting/uncommenting of conditions during query development.

Handle NULLs before using equality or comparison operators

Prevents unexpected results or errors, as NULL comparisons behave differently.

Filter queries before joining or CTE to reduce data volume early

Improves performance and reducing resource usage.

Ensure JOIN conditions use keys or indexed attributes

Indexed joins are faster and more efficient, reducing query execution time.