Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Free Certification Exam Questions Answer Jun 2025 update

Question # 11

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

Explanation:

Explanation

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))

Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column

transactionDate into strings, following the format requested in the question.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name

of the column.

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")

Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped

data – but this is irrelevant for this question, since we do not deal with grouped data here.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this: 2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not

what is asked for in the question.

More info: pyspark.sql.functions.from_unixtime — PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed — PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1, QUESTION NO: 39 (Databricks import instructions)

Question # 12

The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join.

Find the error.

Code block:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 20)

Spark will only broadcast DataFrames that are much smaller than the default value.

The correct option to write configurations is through spark.config and not spark.conf.

Spark will only apply the limit to threshold joins and not to other joins.

The passed limit has the wrong variable type.

The command is evaluated lazily and needs to be followed by an action.

Question # 13

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col('predError').in([3, 6])).count()

The number of rows cannot be determined with the count() operator.

Instead of filter, the select method should be used.

The method used on column predError is incorrect.

Instead of a list, the values need to be passed as single arguments to the in operator.

Numbers 3 and 6 need to be passed as string variables.

Question # 14

Which of the elements in the labeled panels represent the operation performed for broadcast variables?

Larger image

2, 5

2, 3

1, 2

1, 3, 4

Question # 15

The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.

Code block:

transactionsDf.join(itemsDf, "itemId", how="broadcast")

The syntax is wrong, how= should be removed from the code block.

The join method should be replaced by the broadcast method.

Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.

The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.

broadcast is not a valid join type.

Question # 16

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Question # 17

Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?

Entire DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

transactionsDf.filter(col("transactionId").isin([3,4,6])).select([predError, value])

transactionsDf.select(col("transactionId").isin([3,4,6]), "predError", "value")

transactionsDf.filter("transactionId" % 2 == 0).select("predError", "value")

transactionsDf.filter(col("transactionId") % 2 == 0).select("predError", "value")

(Correct)

1.transactionsDf.createOrReplaceTempView("transactionsDf")

2.spark.sql("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")

transactionsDf.filter(col(transactionId).isin([3,4,6]))

Question # 18

Which of the following code blocks returns a single-row DataFrame that only has a column corr which shows the Pearson correlation coefficient between columns predError and value in DataFrame

transactionsDf?

transactionsDf.select(corr(["predError", "value"]).alias("corr")).first()

transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first()

transactionsDf.select(corr(predError, value).alias("corr"))

transactionsDf.select(corr(col("predError"), col("value")).alias("corr"))

(Correct)

transactionsDf.select(corr("predError", "value"))

Explanation:

Explanation

In difficulty, this QUESTION NO: is above what you can expect from the exam. What this QUESTION NO: wants to teach you, however, is to pay attention to the useful details included in the

documentation.

pyspark.sql.corr is not a very common method, but it deals with Spark's data structure in an interesting way. The command takes two columns over multiple rows and returns a single row - similar to

an aggregation function. When examining the documentation (linked below), you will find this code example:

a = range(20)

b = [2 * x for x in range(20)]

df = spark.createDataFrame(zip(a, b), ["a", "b"])

df.agg(corr("a", "b").alias('c')).collect()

[Row(c=1.0)]

See how corr just returns a single row? Once you understand this, you should be suspicious about answers that include first(), since there is no need to just select a single row. A reason to eliminate

those answers is that DataFrame.first() returns an object of type Row, but not DataFrame, as requested in the question.

transactionsDf.select(corr(col("predError"), col("value")).alias("corr"))

Correct! After calculating the Pearson correlation coefficient, the resulting column is correctly renamed to corr.

transactionsDf.select(corr(predError, value).alias("corr"))

No. In this answer, Python will interpret column names predError and value as variable names.

transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first()

Incorrect. first() returns a row, not a DataFrame (see above and linked documentation below).

transactionsDf.select(corr("predError", "value"))

Wrong. Whie this statement returns a DataFrame in the desired shape, the column will have the name corr(predError, value) and not corr.

transactionsDf.select(corr(["predError", "value"]).alias("corr")).first()

False. In addition to first() returning a row, this code block also uses the wrong call structure for command corr which takes two arguments (the two columns to correlate).

More info:

- pyspark.sql.functions.corr — PySpark 3.1.2 documentation

- pyspark.sql.DataFrame.first — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 53 (Databricks import instructions)

Question # 19

Which of the following statements about storage levels is incorrect?

The cache operator on DataFrames is evaluated like a transformation.

In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.

Caching can be undone using the DataFrame.unpersist() operator.

MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.

DISK_ONLY will not use the worker node's memory.

Question # 20

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.

Find the error.

Code block:

1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

The commas in the tuples with the colors should be eliminated.

The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.

Instead of color, a data type should be specified.

The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Summer Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exc65

Free Practice Questions for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: