Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Free Certification Exam Questions Answer Aug 2025 update

Question # 1

The code block shown below should return the number of columns in the CSV file stored at location filePath. From the CSV file, only lines should be read that do not start with a # character. Choose

the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

__1__(__2__.__3__.csv(filePath, __4__).__5__)

1. size

2. spark

3. read()

4. escape='#'

5. columns

1. DataFrame

2. spark

3. read()

4. escape='#'

5. shape[0]

1. len

2. pyspark

3. DataFrameReader

4. comment='#'

5. columns

1. size

2. pyspark

3. DataFrameReader

4. comment='#'

5. columns

1. len

2. spark

3. read

4. comment='#'

5. columns

Explanation:

Explanation

Correct code block:

len(spark.read.csv(filePath, comment='#').columns)

This is a challenging QUESTION NO: with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a QUESTION NO: of this difficulty level

appears in the

exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam.

Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1,

returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard

this answer option.

Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but

this method only returns the length of an array or map stored within a column (documentation linked below). So, using a size() method is not an option here. This leaves us with two potentially valid

answers.

We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql,

which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark references current Spark session

(pyspark.sql.SparkSession) and spark.read therefore returns a DataFrameReader (also see documentation below). Finally, there is only one correct answer option remaining.

More info:

- pyspark.sql.functions.size — PySpark 3.1.2 documentation

- pyspark.sql.DataFrameReader.csv — PySpark 3.1.2 documentation

- pyspark.sql.SparkSession.read — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 50 (Databricks import instructions)

Question # 2

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

DataFrame.repartition(12)

DataFrame.coalesce(6).shuffle()

DataFrame.coalesce(6)

DataFrame.coalesce(6, shuffle=True)

DataFrame.repartition(6)

Question # 3

The code block displayed below contains an error. The code block should return a DataFrame where all entries in column supplier contain the letter combination et in this order. Find the error.

Code block:

itemsDf.filter(Column('supplier').isin('et'))

The Column operator should be replaced by the col operator and instead of isin, contains should be used.

The expression inside the filter parenthesis is malformed and should be replaced by isin('et', 'supplier').

Instead of isin, it should be checked whether column supplier contains the letters et, so isin should be replaced with contains. In addition, the column should be accessed using col['supplier'].

The expression only returns a single column and filter should be replaced by select.

Question # 4

Which of the following describes Spark's Adaptive Query Execution?

Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.

Adaptive Query Execution is enabled in Spark by default.

Adaptive Query Execution reoptimizes queries at execution points.

Adaptive Query Execution features are dynamically switching join strategies and dynamically optimizing skew joins.

Adaptive Query Execution applies to all kinds of queries.

Question # 5

The code block shown below should return all rows of DataFrame itemsDf that have at least 3 items in column itemNameElements. Choose the answer that correctly fills the blanks in the code block

to accomplish this.

Example of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+------------------------------------------+

3.+------+----------------------------------+-------------------+------------------------------------------+

7.+------+----------------------------------+-------------------+------------------------------------------+

Code block:

itemsDf.__1__(__2__(__3__)__4__)

1. select

2. count

3. col("itemNameElements")

4. >3

1. filter

2. count

3. itemNameElements

4. >=3

1. select

2. count

3. "itemNameElements"

4. >3

1. filter

2. size

3. "itemNameElements"

4. >=3

(Correct)

1. select

2. size

3. "itemNameElements"

4. >3

Question # 6

The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate

row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes

contains the element cozy.

A sample of DataFrame itemsDf is below.

Code block:

itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))

1. filter

2. array_contains("cozy")

3. select

4. "itemId"

5. explode

6. "attributes"

1. where

2. "array_contains(attributes, 'cozy')"

3. select

4. itemId

5. explode

6. attributes

1. filter

2. "array_contains(attributes, 'cozy')"

3. select

4. "itemId"

5. map

6. "attributes"

1. filter

2. "array_contains(attributes, cozy)"

3. select

4. "itemId"

5. explode

6. "attributes"

1. filter

2. "array_contains(attributes, 'cozy')"

3. select

4. "itemId"

5. explode

6. "attributes"

Question # 7

Which of the following code blocks reads JSON file imports.json into a DataFrame?

spark.read().mode("json").path("/FileStore/imports.json")

spark.read.format("json").path("/FileStore/imports.json")

spark.read("json", "/FileStore/imports.json")

spark.read.json("/FileStore/imports.json")

spark.read().json("/FileStore/imports.json")

Question # 8

Which of the following describes characteristics of the Dataset API?

The Dataset API does not support unstructured data.

In Python, the Dataset API mainly resembles Pandas' DataFrame API.

In Python, the Dataset API's schema is constructed via type hints.

The Dataset API is available in Scala, but it is not available in Python.

The Dataset API does not provide compile-time type safety.

Question # 9

Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code

block is run twice?

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

itemsDf.sample(fraction=0.1, seed=87238)

itemsDf.sample(fraction=1000, seed=98263)

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

itemsDf.sample(fraction=0.1)

Explanation:

Explanation

itemsDf.sample(fraction=0.1, seed=87238)

Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning

duplicates, you should leave the withReplacement parameter at False, which is the default. Since the QUESTION NO: specifies that the same rows should be returned even if the code block is run

twice,

you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.

Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the

question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there

would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999

balls.

itemsDf.sample(fraction=1000, seed=98263)

Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000/10,000 = 0.1.

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from

the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should

depend on.

itemsDf.sample(fraction=0.1)

Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to

specify a seed.

More info:

- pyspark.sql.DataFrame.sample — PySpark 3.1.2 documentation

- pyspark.sql.DataFrame.sampleBy — PySpark 3.1.2 documentation

- Types of Samplings in PySpark 3. The explanations of the sampling… | by Pinar Ersoy | Towards Data Science

Question # 10

The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by

column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items

in column value. Find the error.

Code block:

transactionsDf.orderBy('value', asc_nulls_first(col('predError')))

Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.

Column value should be wrapped by the col() operator.

Column predError should be sorted in a descending way, putting nulls last.

Column predError should be sorted by desc_nulls_first() instead.

Instead of orderBy, sort should be used.

Summer Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exc65

Free Practice Questions for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: