PySpark coding interview question: find students with the same marks in math and chemistry
In this video, we tackle a PySpark coding interview question commonly asked in Data Engineering interviews. We will:
✅ Generate a dataset of students with marks in multiple subjects.
✅ Implement an optimized PySpark solution to find students having the same marks in Math and Chemistry.
✅ Use window functions (LAG) and filtering for better performance, avoiding expensive operations like `pivot()`.
✅ Discuss optimization techniques to improve PySpark query performance.
This question is often asked in interviews at companies like PWC, KPIT, Accenture, and more!
📌 Code Used in the Video:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, lag
# Create DataFrame
data = [
(101, "Alice", "Math", 85),
(101, "Alice", "Chemistry", 85),
(101, "Alice", "Physics", 78),
(102, "Bob", "Math", 90),
(102, "Bob", "Physics", 88),
(103, "Charlie", "Math", 75),
(103, "Charlie", "Chemistry", 75),
(104, "David", "Math", 88),
(104, "David", "Chemistry", 88),
(104, "David", "Physics", 95),
(105, "Eve", "Chemistry", 91),
]
columns = ["Student_ID", "Name", "Subject", "Marks"]
df = spark.createDataFrame(data, columns)
df.display()
📌 **Subscribe for More PySpark Interview Questions!**
#pyspark #databricks #dataengineering #interviewquestions #bigdata #spark #azuredataengineer #hadooptutorial #interviewquestionsandanswers #accentureinterview #kpittech