Data Engineer Interview Prep

−

SQL Interview Questions

Comprehensive SQL questions covering window functions, joins, aggregations, and query optimization.

−

1. Find the top 3 cities with the highest sales per month

Sample Data/Table:

sale_id	city	sale_date	amount
1	Mumbai	2024-01-10	5000
2	Delhi	2024-01-15	7000
3	Bangalore	2024-01-20	10000
4	Chennai	2024-02-05	3000
5	Mumbai	2024-02-08	9000

SQL Solution / Explanation:

SELECT sale_month, city, total_sales
FROM (
    SELECT
        DATE_FORMAT(sale_date, '%Y-%m') AS sale_month,
        city,
        SUM(amount) AS total_sales,
        ROW_NUMBER() OVER (PARTITION BY DATE_FORMAT(sale_date, '%Y-%m') ORDER BY SUM(amount) DESC) AS rn
    FROM sales_table
    GROUP BY sale_month, city
) ranked
WHERE rn <= 3;

Explanation: This query uses window functions to rank cities by total sales within each month. The ROW_NUMBER() function assigns ranks, and we filter for the top 3 cities per month using WHERE rn <= 3.

−

2. Write an SQL query to calculate the running total of sales for each city

Sample Data/Table:

sale_id	city	sale_date	amount
1	Mumbai	2024-01-10	5000
2	Delhi	2024-01-15	7000
3	Mumbai	2024-01-20	3000
4	Delhi	2024-02-05	6000
5	Mumbai	2024-02-08	8000

SQL Solution / Explanation:

SELECT
    sale_id,
    city,
    sale_date,
    amount,
    SUM(amount) OVER (PARTITION BY city ORDER BY sale_date) AS running_total
FROM sales_data
ORDER BY city, sale_date;

Explanation: The SUM(amount) OVER (PARTITION BY city ORDER BY sale_date) creates a running total for each city. The window function accumulates the sum ordered by sale date within each city partition.

−

3. Find the second highest salary of employees

Sample Data/Table:

emp_id	emp_name	salary	department
1	Ravi	70000	HR
2	Priya	90000	IT
3	Kunal	85000	Finance
4	Aisha	60000	IT
5	Rahul	95000	HR

SQL Solution / Explanation:

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Explanation: This query finds the maximum salary that is less than the overall maximum salary. The subquery finds the highest salary, and the outer query finds the maximum of all remaining salaries.

−

4. Find employees who have the same salary as someone in the same department

Sample Data/Table:

emp_id	emp_name	salary	department
1	Neha	50000	HR
2	Ravi	70000	IT
3	Aman	50000	HR
4	Pooja	90000	IT
5	Karan	70000	IT

SQL Solution / Explanation:

SELECT *
FROM employee_salary e1
WHERE EXISTS (
    SELECT 1
    FROM employee_salary e2
    WHERE e1.department = e2.department
      AND e1.salary = e2.salary
      AND e1.emp_id <> e2.emp_id
);

Explanation: This query uses a correlated subquery with EXISTS to find employees who have at least one other employee in the same department with the same salary. The condition e1.emp_id <> e2.emp_id ensures we don't match an employee with themselves.

−

5. Write an SQL query to find duplicate records in a table

Sample Data/Table:

user_id	user_name	email
1	Sameer	sameer@gmail.com
2	Anjali	anjali@gmail.com
3	Sameer	sameer@gmail.com
4	Rohan	rohan@gmail.com
5	Rohan	rohan@gmail.com

SQL Solution / Explanation:

SELECT user_name, email, COUNT(*)
FROM users
GROUP BY user_name, email
HAVING COUNT(*) > 1;

Explanation: This query groups records by user_name and email, then uses HAVING COUNT(*) > 1 to filter only the groups that have more than one occurrence, effectively finding duplicates.

−

6. Write an SQL query to delete duplicate rows while keeping only one unique record

Sample Data/Table:

user_id	user_name	email
1	Sameer	sameer@gmail.com
2	Anjali	anjali@gmail.com
3	Sameer	sameer@gmail.com
4	Rohan	rohan@gmail.com
5	Rohan	rohan@gmail.com

SQL Solution / Explanation:

DELETE FROM users
WHERE user_id NOT IN (
    SELECT MIN(user_id)
    FROM users
    GROUP BY user_name, email
);

Explanation: This query keeps the record with the minimum user_id for each unique combination of user_name and email, and deletes all other duplicate records. The subquery identifies which IDs to keep, and the outer DELETE removes everything else.

−

7. Write an SQL query to pivot a table by months

Sample Data/Table:

sale_id	city	sale_date	amount
1	Mumbai	2024-01-10	5000
2	Delhi	2024-02-15	7000
3	Mumbai	2024-01-20	3000
4	Delhi	2024-03-05	6000
5	Mumbai	2024-02-08	8000

SQL Solution / Explanation:

SELECT
    city,
    SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-01' THEN amount ELSE 0 END) AS Jan_2024,
    SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-02' THEN amount ELSE 0 END) AS Feb_2024,
    SUM(CASE WHEN DATE_FORMAT(sale_date, '%Y-%m') = '2024-03' THEN amount ELSE 0 END) AS Mar_2024
FROM sales_data
GROUP BY city;

Explanation: This query uses conditional aggregation with CASE statements to pivot the data. Each month becomes a separate column, and the sum of amounts for that month is calculated for each city.

−

8. Find customers who placed at least 3 orders in the last 6 months

Sample Data/Table:

order_id	customer_id	order_date	amount
1	101	2024-01-10	1000
2	102	2024-02-15	2000
3	101	2024-03-20	1500
4	103	2024-04-05	2500
5	101	2024-05-08	3000

SQL Solution / Explanation:

SELECT customer_id
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
GROUP BY customer_id
HAVING COUNT(*) >= 3;

Explanation: This query filters orders from the last 6 months using DATE_SUB(CURDATE(), INTERVAL 6 MONTH), groups by customer, and uses HAVING COUNT(*) >= 3 to find customers with at least 3 orders in that period.

−

9. Normalization vs. Denormalization – What are they, and when should each be used in a data pipeline?

Explanation:

📊

Normalization

Breaking tables into smaller, related tables to reduce redundancy and improve data integrity. Use for OLTP, frequent updates, data consistency.

OLTP Data Integrity Reduced Redundancy

📈

Denormalization

Combining tables to reduce joins and improve read performance. Use for OLAP, reporting, analytics.

OLAP Reporting Fast Reads

−

10. Indexing in SQL – Explain clustered vs. non-clustered indexes. How do they impact query performance?

Explanation:

🗂️

Clustered Index

Determines physical order of data, one per table, fast for range queries.

Physical Order One Per Table Range Queries

📑

SELECT pipeline_name, MONTHNAME(run_date) AS month, COUNT(*) AS failures,
       MAX(COUNT(*)) OVER () AS max_failures
FROM pipeline_log
WHERE status='FAILED' AND MONTH(run_date)=MONTH(CURDATE())
GROUP BY pipeline_name, MONTHNAME(run_date);

Explanation: Uses MONTHNAME() to get the month name, counts failures per pipeline, and uses a window function to find the maximum failures across all pipelines.

−

14. Develop a code to generate the following output based on the provided input

SQL Solution / Explanation:

SELECT id,
       AVG(CASE WHEN item_name='Apple' THEN value END) AS Apple,
       AVG(CASE WHEN item_name='Orange' THEN value END) AS Orange,
       AVG(CASE WHEN item_name='Banana' THEN value END) AS Banana
FROM input_table
GROUP BY id;

Explanation: Uses conditional aggregation with CASE statements to pivot data, creating separate columns for each item type.

−

15. Identify the 3rd highest sales amount in each region based on the saleid, product, region, and salesamount data

SQL Solution / Explanation:

SELECT region, salesamount
FROM (
    SELECT region, salesamount,
           ROW_NUMBER() OVER (PARTITION BY region ORDER BY salesamount DESC) AS rn
    FROM sales
) t
WHERE rn = 3;

Explanation: Uses ROW_NUMBER() window function partitioned by region and ordered by salesamount descending, then filters for the 3rd row in each partition.

−

16. Determine the left, right, and inner outputs from the given dataset

Explanation:

⬅️

Left Join

All rows from A, matched/unmatched from B.

➡️

Right Join

All rows from B, matched/unmatched from A.

🔗

Sample SQL Query (conceptual, syntax depends on specific SQL dialect/DB):

-- Example for SQL Server (using OPTION)
SELECT /*+ HINT(OPTION (HASH JOIN, FORCE ORDER)) */
       f.fact_column, d.dimension_column
FROM
    FactTable f
INNER JOIN
    DimensionTable d ON f.dim_key = d.dim_key;

-- Example for Spark SQL (using hint)
SELECT /*+ BROADCAST(d) */ f.fact_column, d.dimension_column
FROM
    FactTable f
INNER JOIN
    DimensionTable d ON f.dim_key = d.dim_key;

-- In most modern databases, the optimizer will automatically choose a broadcast hash join
-- if one table is small enough, so explicit hints are often not needed unless
-- you want to override default behavior or for specific edge cases.

−

ETL & Data Warehouse Concepts

This section covers ETL optimization, data warehouse design, and CDC implementation strategies.

−

65. Your source has 10 million records. How will you optimize the ETL job?

Solution / Answer Ideas:

Use partitioning (Hash or Range) to parallelize processing
Minimize in-memory transformations (avoid Lookups on large data)
Push logic to source DB using SQL override
Use bulk loading or batch commits

−

66. How do you handle a scenario where some records are rejected during transformation?

Solution / Answer Ideas:

Redirect bad records to a reject file/table
Capture error message, source key, and timestamp for debugging
Use error handling stages (e.g., Reject links in DataStage, Error log in Informatica)

−

67. You need to pass different file paths for DEV, QA, and PROD. How would you do this?

Solution / Answer Ideas:

Use parameter files or environment variables
DataStage: ParamSet or DSParam file
Informatica: Parameter file with $$SourceFilePath

−

68. How do you implement incremental load (CDC)?

Solution / Answer Ideas:

Use Last Updated Timestamp or Surrogate Key
Store last load timestamp in a control table
Filter source using WHERE last_update > :last_loaded_time
Informatica: CDC mappings or Change Data Capture tools

−

69. How would you load the most recent record per customer from a transaction table?

Solution / Answer Ideas:

Sort by Customer ID and Date in descending order
Use row_number() or stage logic to pick only first record
Informatica: Use Sorter + Expression to flag first row
DataStage: Use Remove Duplicates or Transformer with stage variables

−

70. How do you implement Slowly Changing Dimensions (SCD) in a data warehouse?

Explanation:

SCD Type 1: Overwrite old data.

Type 2: Add new row with versioning/timestamps.

Type 3: Add new column for previous value. Use ETL tools or SQL merge/upsert logic to manage SCDs.

−

71. Explain the concept of star schema and snowflake schema in data modeling

Explanation:

Star schema: Central fact table linked to denormalized dimension tables; simple, fast queries.

Snowflake schema: Dimensions are normalized into multiple related tables; reduces redundancy but more complex joins.

−

72. How would you design a fact table for an e-commerce platform?

Explanation:

Include transaction-level facts (e.g., sales amount, quantity), foreign keys to dimensions (date, product, customer), and measures (discount, tax). Ensure granularity matches business needs (e.g., order line item).

−

Explanation:

Autoscaling: Dynamically adjusts cluster resources.

Auto-Termination: Shuts down cluster after inactivity.

−

79. Discuss the disparities between RDD and Dataframe

Explanation:

RDD: Low-level, type-safe, no schema, less optimized.

−

84. Write a Python script to read a CSV file and load it into a DataFrame

Python Solution:

import pandas as pd
df = pd.read_csv('filename.csv')
print(df.head())

−

85. How do you handle exceptions in Python using try-except blocks?

Python Solution:

try:
    # risky code
    result = 10 / 0
except ZeroDivisionError as e:
    print(f'Error: {e}')
finally:
    print('Cleanup or final steps')

−

86. In PySpark, how would you perform a join operation between two large DataFrames efficiently?

PySpark Solution / Explanation:

Use broadcast() for small DataFrames to avoid shuffles:

from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), 'key')
# For large-large joins, ensure both are partitioned on the join key.

−

87. Write a PySpark code to find the top 3 customers with the highest revenue per region

PySpark Solution:

from pyspark.sql.window import Window
import pyspark.sql.functions as F
window = Window.partitionBy('region').orderBy(F.desc('revenue'))
df.withColumn('rank', F.row_number().over(window)) \
  .filter('rank <= 3') \
  .select('customer_id', 'region', 'revenue')

−

88. What is the difference between partitioning and bucketing in PySpark?

Explanation:

Partitioning: Splits data into directories based on column values, improving query performance for partitioned columns.

import pandas as pd
sales = pd.read_csv('sales.csv')
promos = pd.read_csv('promotions.csv')
merged = pd.merge(sales, promos, on='product_id')
valid = merged[merged['promotion_active'] == True]
total_sales = valid.groupby('product_id')['sales_amount'].sum()
print(total_sales)

Sample SQL Query:

              SELECT c.CustomerID, c.CustomerName
FROM Customers c
JOIN Purchases p ON c.CustomerID = p.CustomerID
GROUP BY c.CustomerID, c.CustomerName
HAVING COUNT(DISTINCT p.ProductID) = (SELECT COUNT(DISTINCT ProductID) FROM Purchases);
            

Calculates days to add to reach the next desired weekday. Adjust @target_weekday as needed (1=Sunday, 2=Monday, ..., 7=Saturday).

Corrected SQL Code:

              DECLARE @target_weekday INT = 4; -- 1=Sunday, 4=Wednesday
SELECT DATEADD(DAY, ((@target_weekday - DATEPART(WEEKDAY, GETDATE()) + 7) % 7) + 1, GET
            

−

Comprehensive questions covering Azure Data Factory, Databricks, Synapse Analytics, and Azure ecosystem.

−

127. How do you build an ETL pipeline using Azure Data Factory?

Explanation:

Define linked services (data sources), create datasets, build pipelines with activities (Copy, Data Flow, etc.), use triggers for scheduling, and monitor pipeline runs. Parameterize for reusability and handle errors with activities like Web or Stored Procedure.

−

128. What are the different types of triggers in ADF and when to use them?

Explanation:

Schedule Trigger: Runs at specific times.

−

140. Explain the difference between Import and Direct Query modes. Which would you choose for large datasets?

Detailed Interview Response:

Import: Loads data into Power BI for fast, in-memory analysis but is static and needs refreshing.

Direct Query: Leaves data in the source, enabling real-time queries but may be slower and limited by source performance.

For very large datasets, use Direct Query to avoid memory constraints, but optimize source performance.

−

141. What are slicers, and how do they differ from visual-level filters? Discuss their impact on data in a Power BI dashboard

Detailed Interview Response:

Slicers: Are visual controls that let users filter data interactively across multiple visuals.

Visual-level filters: Apply only to a single visual.

146. You notice a sudden drop in conversion rates (from search to booking) for hotels in a particular city. How would you investigate the root cause and propose solutions?

Detailed Interview Response:

Analyze funnel data (searches, clicks, bookings), segment by device, channel, and user type. Check for UI changes, pricing issues, inventory problems, or external events. Propose A/B tests, user surveys, and cross-team collaboration to address findings.

−

147. Booking.com is launching a new feature that allows users to book multi-city trips. How would you measure its post-launch, and what metrics would you track to ensure its adoption and profitability?

Detailed Interview Response:

Track adoption rate (feature usage), conversion rate, average booking value, customer retention, and feedback. Compare with single-city bookings, monitor technical performance, and analyze user cohorts for repeat usage and upsell/cross-sell impact.

−

General Data Engineering & Project Experience

Questions about your personal project experience and contributions.

−

148. Can you elaborate on your current project and the specific contributions you make to it?

Sample Answer:

Briefly describe the project domain, tech stack, business goal, and your roles (e.g., ETL design, data modeling, performance tuning, automation, etc.).

−

Explanation / Correction:

Finds the maximum salary less than the overall max, giving the second-highest.

Sample SQL Answer:

              SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
            

−

153. Write a SQL query to find all employees who earn more than their managers.

Explanation / Correction:

Self-join employees to their managers and compare salaries.

Sample SQL Answer:

              SELECT e1.*
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.id
WHERE e1.salary > e2.salary;
            

−

154. Find the duplicate rows in a table without using GROUP BY.

Explanation / Correction:

Uses EXISTS to find duplicates based on columns, avoiding GROUP BY.

Sample SQL Answer:

              SELECT t1.*
FROM table_name t1
WHERE EXISTS (
  SELECT 1 FROM table_name t2
  WHERE t1.column1 = t2.column1 AND t1.column2 = t2.column2 AND t1.rowid <> t2.rowid
);
            

−

155. Write a SQL query to find the top 10% of earners in a table.

Explanation / Correction:

NTILE(10) splits data into 10 buckets; decile 1 is top 10%.

Sample SQL Answer:

              SELECT *
FROM (
  SELECT *, NTILE(10) OVER (ORDER BY salary DESC) AS decile
  FROM employees
) ranked
WHERE decile = 1;
            

−

156. Find the cumulative sum of a column in a table.

Explanation / Correction:

SUM() OVER (ORDER BY ...) computes running totals.

Sample SQL Answer:

              SELECT *, column - LEAD(column) OVER (ORDER BY id) AS diff_with_next
FROM table_name;
            

−

157. Write a SQL query to find all employees who have never taken a leave.

Explanation / Correction:

Finds employees whose IDs are not in the leave records.

Sample SQL Answer:

              SELECT *
FROM employees
WHERE id NOT IN (SELECT employee_id FROM leaves);
            

−

158. Find the difference between the current row and the next row in a table.

Explanation / Correction:

LEAD() gets the next row's value for difference calculation.

Sample SQL Answer:

              SELECT *, column - LEAD(column) OVER (ORDER BY id) AS diff_with_next
FROM table_name;
            

−

159. Write a SQL query to find all departments with more than one employee.

Explanation / Correction:

GROUP BY and HAVING find departments with multiple employees.

Sample SQL Answer:

              SELECT department
FROM employees
GROUP BY department
HAVING COUNT(*) > 1;
            

−

160. Find the maximum value of a column for each group without using GROUP BY.

Explanation / Correction:

Window function MAX() OVER (PARTITION BY ...) finds group max per row.

Sample SQL Answer:

              SELECT *
FROM (
  SELECT *, MAX(column) OVER (PARTITION BY group_column) AS max_in_group
  FROM table_name
) t
WHERE column = max_in_group;
            

−

161. Write a SQL query to find all employees who have taken more than 3 leaves in a month.

Explanation / Correction:

Group by employee and month, then filter for those with >3 leaves.

Sample SQL Answer:

              SELECT employee_id, EXTRACT(YEAR FROM leave_date) AS year, EXTRACT(MONTH FROM leave_dat
FROM leaves
GROUP BY employee_id, year, month
HAVING COUNT(*) > 3;
            

Joins Orders table to itself to find customers with orders on consecutive days.

Sample SQL Query:

              SELECT DISTINCT o1.CustomerID
FROM Orders o1
JOIN Orders o2
  ON o1.CustomerID = o2.CustomerID
  AND DATEDIFF(DAY, o1.OrderDate, o2.OrderDate) = 1;
            

✓

Ensure JOIN conditions use keys or indexed attributes

Indexed joins are faster and more efficient, reducing query execution time.

Data Engineering Interview Questions & Solutions

SQL Interview Questions

1. Find the top 3 cities with the highest sales per month

Sample Data/Table:

SQL Solution / Explanation:

2. Write an SQL query to calculate the running total of sales for each city

Sample Data/Table:

SQL Solution / Explanation:

3. Find the second highest salary of employees

Sample Data/Table:

SQL Solution / Explanation:

4. Find employees who have the same salary as someone in the same department

Sample Data/Table:

SQL Solution / Explanation:

5. Write an SQL query to find duplicate records in a table

Sample Data/Table:

SQL Solution / Explanation:

6. Write an SQL query to delete duplicate rows while keeping only one unique record

Sample Data/Table:

SQL Solution / Explanation:

7. Write an SQL query to pivot a table by months

Sample Data/Table:

SQL Solution / Explanation:

8. Find customers who placed at least 3 orders in the last 6 months

Sample Data/Table:

SQL Solution / Explanation:

9. Normalization vs. Denormalization – What are they, and when should each be used in a data pipeline?

Explanation:

Normalization

Denormalization

10. Indexing in SQL – Explain clustered vs. non-clustered indexes. How do they impact query performance?

Explanation:

Clustered Index

Non-Clustered Index

11. Write an SQL query to find the second highest salary from an employee table

SQL Solution / Explanation:

12. How do you handle NULL values in SQL joins?

Explanation:

13. Extract pipeline name, current month name, number of failures, and identify the maximum failures in the current month

SQL Solution / Explanation:

14. Develop a code to generate the following output based on the provided input

SQL Solution / Explanation:

15. Identify the 3rd highest sales amount in each region based on the saleid, product, region, and salesamount data

SQL Solution / Explanation:

16. Determine the left, right, and inner outputs from the given dataset

Explanation:

Left Join

Right Join

Inner Join

17. Write an SQL query to calculate the customer churn rate over the last 6 months

SQL Solution / Explanation:

18. Calculate the cancellation rate for each room type over the last 6 months, considering only bookings of minimum stay of 2 nights

Detailed Interview Response:

19. Determine the average conversion rate (confirmed bookings vs. search events) for users grouped by their country and device type

Detailed Interview Response:

20. Identify properties that have consistently underperformed compared to the average booking rate of their region over the last 12 months

Detailed Interview Response:

21. Detect instances of demand surge where the number of bookings in an hour exceeds the hourly average by more than 50%

Detailed Interview Response:

22. What challenges might arise when querying sharded databases, especially for calculating global metrics like average booking rates?

Detailed Interview Response:

23. Explain how you would handle booking timestamps originating from different time zones when querying for global daily booking patterns

Detailed Interview Response:

24. How would you balance normalization for data integrity and denormalization for query performance?

Detailed Interview Response:

25. If two systems simultaneously update the same booking record, what mechanisms would you use in SQL to prevent data conflicts and ensure consistency?

Detailed Interview Response:

26. Explain the scenarios where window functions outperform traditional group-by clauses in SQL

Detailed Interview Response:

27. Find all employees who earn more than the average salary

Explanation:

Sample SQL Query:

28. Retrieve names of employees who work in the same department as 'John'

Explanation:

Sample SQL Query:

29. Display the second highest salary from the Employee table

Explanation:

Sample SQL Query:

30. Find all customers who have made more than five orders

Explanation: