Databricks Databricks-Certified-Professional-Data-Engineer Free Certification Exam Questions Answer May 2026 update

Question # 51

Given the following error traceback (from display(df.select(3* " heartrate " ))) which shows AnalysisException: cannot resolve ' heartrateheartrateheartrate ' , which statement describes the error being raised?

There is a type error because a DataFrame object cannot be multiplied.

There is a syntax error because the heartrate column is not correctly identified as a column.

There is no column in the table named heartrateheartrateheartrate.

There is a type error because a column object cannot be multiplied.

Question # 52

A data engineer has created a transactions Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool that requires Apache Iceberg format.

What should the data engineer do?

Require the analytics team to use a tool that supports Delta table.

Enable uniform on the transactions table to ' iceberg ' so that the table can be read as an Iceberg table.

Create an Iceberg copy of the transactions Delta table which can be used by the analytics team.

Convert the transactions Delta table to Iceberg and enable uniform so that the table can be read as a Delta table.

Question # 53

A junior developer complains that the code in their notebook isn ' t producing the correct results in the development environment. A shared screenshot reveals that while they ' re using a notebook versioned with Databricks Repos, they ' re using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

Merge all changes back to the main branch in the remote Git repository and clone the repo again

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Question # 54

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

The total count of rows is calculated by scanning all data files

The total count of rows will be returned from cached results unless REFRESH is run

The total count of records is calculated from the Delta transaction logs

The total count of records is calculated from the parquet file metadata

The total count of records is calculated from the Hive metastore

Question # 55

How are the operational aspects of Lakeflow Declarative Pipelines different from Spark Structured Streaming ?

Lakeflow Declarative Pipelines manage the orchestration of multi-stage pipelines automatically, while Structured Streaming requires external orchestration for complex dependencies.

Structured Streaming can process continuous data streams, while Lakeflow Declarative Pipelines cannot.

Lakeflow Declarative Pipelines can write to Delta Lake format, while Structured Streaming cannot.

Lakeflow Declarative Pipelines automatically handle schema evolution, while Structured Streaming always requires manual schema management.

Question # 56

A data engineer is attempting to execute the following PySpark code:

df = spark.read.table( " sales " )

result = df.groupBy( " region " ).agg(sum( " revenue " ))

However, upon inspecting the execution plan and profiling the Spark job, they observe excessive data shuffling during the aggregation phase.

Which technique should be applied to reduce shuffling during the groupBy aggregation operation?

Caching the DataFrame df.

Repartition by region before aggregation.

Use coalesce() after the aggregation.

Use broadcast join.

Question # 57

A security analytics pipeline must enrich billions of raw connection logs with geolocation data. The join hinges on finding which IPv4 range each event’s address falls into.

Table 1: network_events (≈ 5 billion rows)

event_id ip_int

42 3232235777

Table 2: ip_ranges (≈ 2 million rows)

start_ip_int end_ip_int country

3232235520 3232236031 US

The query is currently very slow:

SELECT n.event_id, n.ip_int, r.country

FROM network_events n

JOIN ip_ranges r

ON n.ip_int BETWEEN r.start_ip_int AND r.end_ip_int;

Question:

Which change will most dramatically accelerate the query while preserving its logic?

Increase spark.sql.shuffle.partitions from 200 to 10000.

Add a range-join hint /*+ RANGE_JOIN(r, 65536) */.

Force a sort-merge join with /*+ MERGE(r) */.

Add a broadcast hint: /*+ BROADCAST(r) */ for ip_ranges.

Question # 58

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize and Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Explanation:

For this scenario where a one-TB JSON dataset needs to be converted into Parquet format without employing Delta Lake ' s auto-sizing features, the goal is to avoid unnecessary data shuffles and yet ensure optimal file sizes for the output Parquet files. Here’s a breakdown of why option A is most suitable:

Setting maxPartitionBytes: The spark.sql.files.maxPartitionBytes configuration controls the size of blocks that Spark reads from the data source (in this case, the JSON files) but also influences the output size of files when data is written without repartition or coalesce operations. Setting this parameter to 512 MB directly addresses the requirement to manage the output file size effectively.

Data Ingestion and Processing:

Ingesting Data: Load the JSON dataset into a DataFrame.

Applying Transformations: Perform any required narrow transformations that do not involve shuffling data (like filtering or adding new columns).

Writing to Parquet: Directly write the transformed DataFrame to Parquet files. The setting for maxPartitionBytes ensures that each part-file is approximately 512 MB, meeting the requirement for part-file size without additional steps to repartition or coalesce the data.

Performance Consideration: This approach is optimal because:

It avoids the overhead of shuffling data, which can be significant, especially with large datasets.

It directly ties the read/write operations to a configuration that matches the target output size, making it efficient in terms of both computation and I/O operations.

Alternative Options Analysis:

Option B and D: Involves repartitioning, which would trigger a shuffle of the data, contradicting the requirement to avoid shuffling for performance reasons.

Option C: Uses coalesce, which is less intensive than repartition but can still lead to uneven partition sizes and does not directly control the output file size as effectively as setting maxPartitionBytes.

Option E: Setting shuffle partitions to 512 doesn’t directly control the output file size for writing to Parquet and could lead to smaller files depending on the dataset ' s partitioning post-transformations.

References

Apache Spark Configuration

Writing to Parquet Files in Spark

Question # 59

A data engineer has configured their Databricks Asset Bundle with multiple targets in databricks.yml and deployed it to the production workspace. Now, to validate the deployment, they need to invoke a job named my_project_job specifically within the prod target context. Assuming the job is already deployed, they need to trigger its execution while ensuring the target-specific configuration is respected.

Which command will trigger the job execution?

databricks execute my_project_job -e prod

databricks job run my_project_job --env prod

databricks run my_project_job -t prod

databricks bundle run my_project_job -t prod

Pre-Summer Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

Free Practice Questions for Databricks Databricks-Certified-Professional-Data-Engineer Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: