Databricks Databricks-Certified-Professional-Data-Engineer Free Certification Exam Questions Answer Jul 2025 update

Question # 31

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

Use &Pip install in a notebook cell

Run source env/bin/activate in a notebook setup script

Install libraries from PyPi using the cluster UI

Use &sh install in a notebook cell

Question # 32

Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.

Which statement describes a limitation of Databricks Secrets?

Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text.

Account administrators can see all secrets in plain text by logging on to the Databricks Accounts console.

Secrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default.

Iterating through a stored secret and printing each character will display secret contents in plain text.

The Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.

Question # 33

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

MERGE INTO customers

USING (

SELECT updates.customer_id as merge_ey, updates .*

FROM updates

UNION ALL

SELECT NULL as merge_key, updates .*

FROM updates JOIN customers

ON updates.customer_id = customers.customer_id

WHERE customers.current = true AND updates.address <> customers.address

) staged_updates

ON customers.customer_id = mergekey

WHEN MATCHED AND customers. current = true AND customers.address <> staged_updates.address THEN

UPDATE SET current = false, end_date = staged_updates.effective_date

WHEN NOT MATCHED THEN

INSERT (customer_id, address, current, effective_date, end_date)

VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)

Which statement describes this implementation?

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

Question # 34

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG

post_text STRING

post_id STRING

longitude FLOAT

latitude FLOAT

post_time TIMESTAMP

date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

date

user_id

post_id

post_time

Explanation:

Partitioning a Delta Lake table is a strategy used to improve query performance by dividing the table into distinct segments based on the values of a specific column. This approach allows queries to scan only the relevant partitions, thereby reducing the amount of data read and enhancing performance.

Considerations for Choosing a Partition Column:

Cardinality:Columns with high cardinality (i.e., a large number of unique values) are generally poor choices for partitioning. High cardinality can lead to a large number of small partitions, which can degrade performance.

Query Patterns:The partition column should align with common query filters. If queries frequently filter data based on a particular column, partitioning by that column can be beneficial.

Partition Size:Each partition should ideally contain at least 1 GB of data. This ensures that partitions are neither too small (leading to too many partitions) nor too large (negating the benefits of partitioning).

Evaluation of Columns:

date:

Cardinality:Typically low, especially if data spans over days, months, or years.

Query Patterns:Many analytical queries filter data based on date ranges.

Partition Size:Likely to meet the 1 GB threshold per partition, depending on data volume.

user_id:

Cardinality:High, as each user has a unique ID.

Query Patterns:While some queries might filter by user_id, the high cardinality makes it unsuitable for partitioning.

Partition Size:Partitions could be too small, leading to inefficiencies.

post_id:

Cardinality:Extremely high, with each post having a unique ID.

Query Patterns:Unlikely to be used for filtering large datasets.

Partition Size:Each partition would be very small, resulting in a large number of partitions.

post_time:

Cardinality:High, especially if it includes exact timestamps.

Query Patterns:Queries might filter by time, but the high cardinality poses challenges.

Partition Size:Similar to user_id, partitions could be too small.

Conclusion:

Given the considerations, the date column is the most suitable candidate for partitioning. It has low cardinality, aligns with common query patterns, and is likely to result in appropriately sized partitions.

References:

Delta Lake Best Practices

Partitioning in Delta Lake

Question # 35

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFramedf. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFramedfhas the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

to_interval("event_time", "5 minutes").alias("time")

window("event_time", "5 minutes").alias("time")

"event_time"

window("event_time", "10 minutes").alias("time")

lag("event_time", "10 minutes").alias("time")

Question # 36

Which statement describes the correct use of pyspark.sql.functions.broadcast?

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Question # 37

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in thegeo_lookuptable.

Before executing the code, runningSHOWTABLESon the current database indicates the database contains only two tables:geo_lookupandsales.

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.

Both commands will fail. No new variables, tables, or views will be created.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.

Question # 38

A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.

The user attempts and fails to accomplish this by adding an expectation to the report table definition.

Which approach would allow using DLT expectations to validate all expected records are present in this table?

Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table.

Define a function that performs a left outer join on validation_copy and report and report, and check against the result in a DLT expectation for the report table

Define a temporary table that perform a left outer join on validation_copy and report, and define an expectation that no report key values are null

Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table

Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

Free Practice Questions for Databricks Databricks-Certified-Professional-Data-Engineer Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: