Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Free Certification Exam Questions Answer Jul 2025 update

Question # 1

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

The Spark engine requires manual intervention to start executing transformations.

Only actions trigger the execution of the transformation pipeline.

Transformations are executed immediately to build the lineage graph.

The Spark engine optimizes the execution plan during the transformations, causing delays.

Transformations are evaluated lazily.

Question # 2

An engineer has a large ORC file located at/file/test_data.orcand wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e.,col1,col2, during the reading process?

spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")

spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")

spark.read.orc("/file/test_data.orc").selected("col1", "col2")

spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Question # 3

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

[Row(name='bambi'), Row(name='alladin', age=20)]

[Row(name='alladin', age=20)]

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

The code throws an error due to a schema mismatch.

Question # 4

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Optimize the data processing logic by repartitioning the DataFrame.

Modify the Spark configuration to disable garbage collection

Increase the memory allocated to the Spark Driver.

Cache large DataFrames to persist them in memory.

Question # 5

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.

Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.

Use spark.read.json() with the inferSchema option set to true

Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.

Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.

Question # 6

Given a DataFramedfthat has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

Same number as the cluster executors

Question # 7

A Spark application is experiencing performance issues in client mode because the driver is resource-constrained.

How should this issue be resolved?

Add more executor instances to the cluster

Increase the driver memory on the client machine

Switch the deployment mode to cluster mode

Switch the deployment mode to local mode

Question # 8

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

It provides a way to run Spark applications remotely in any programming language

It can be used to interact with any remote cluster using the REST API

It allows for remote execution of Spark jobs

It is primarily used for data ingestion into Spark from external sources

Question # 9

A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:

Which operation is supported with streamingdf ?

streaming_df. select (countDistinct ("Name") )

streaming_df.groupby("Id") .count ()

streaming_df.orderBy("timestamp").limit(4)

streaming_df.filter (col("count") < 30).show()

Explanation:

Which operation is supported with streaming_df?

A. streaming_df.select(countDistinct("Name"))

B. streaming_df.groupby("Id").count()

C. streaming_df.orderBy("timestamp").limit(4)

D. streaming_df.filter(col("count") < 30).show()

Answer: B

Comprehensive and Detailed Explanation:

In Structured Streaming, only a limited subset of operations is supported due to the nature of unbounded data. Operations like sorting (orderBy) and global aggregation (countDistinct) require a full view of the dataset, which is not possible with streaming data unless specific watermarks or windows are defined.

Review of Each Option:

A. select(countDistinct("Name"))

Not allowed — Global aggregation like countDistinct() requires the full dataset and is not supported directly in streaming without watermark and windowing logic.

[Reference: Databricks Structured Streaming Guide – Unsupported Operations., B. groupby("Id").count()Supported — Streaming aggregations over a key (like groupBy("Id")) are supported. Spark maintains intermediate state for each key.Reference: Databricks Docs → Aggregations in Structured Streaming (https://docs.databricks.com/structured-streaming/aggregation.html), C. orderBy("timestamp").limit(4)Not allowed — Sorting and limiting require a full view of the stream (which is infinite), so this is unsupported in streaming DataFrames.Reference: Spark Structured Streaming – Unsupported Operations (ordering without watermark/window not allowed)., D. filter(col("count") < 30).show()Not allowed — show() is a blocking operation used for debugging batch DataFrames; it's not allowed on streaming DataFrames.Reference: Structured Streaming Programming Guide – Output operations like show() are not supported., , Reference Extract from Official Guide:, “Operations like orderBy, limit, show, and countDistinct are not supported in Structured Streaming because they require the full dataset to compute a result. Use groupBy(...).agg(...) instead for incremental aggregations.”— Databricks Structured Streaming Programming Guide]

Question # 10

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Execute their pyspark shell with the option--remote "https://localhost"

Execute their pyspark shell with the option--remote "sc://localhost"

Set the environment variableSPARK_REMOTE="sc://localhost"before starting the pyspark shell

Add.remote("sc://localhost")to their SparkSession.builder calls in their Spark code

Ensure the Spark propertyspark.connect.grpc.binding.portis set to 15002 in the application code

Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

Free Practice Questions for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: