Databricks Databricks-Certified-Professional-Data-Engineer Free Certification Exam Questions Answer Jul 2025 update

Question # 1

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

userLookup.join(streamingDF, ["userid"], how="inner")

streamingDF.join(userLookup, ["user_id"], how="outer")

streamingDF.join(userLookup, ["user_id”], how="left")

streamingDF.join(userLookup, ["userid"], how="inner")

userLookup.join(streamingDF, ["user_id"], how="right")

Question # 2

A Data engineer wants to run unit’s tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production.

How can the data engineer run unit tests against function that work with data in production?

Run unit tests against non-production data that closely mirrors production

Define and unit test functions using Files in Repos

Define units test and functions within the same notebook

Define and import unit test functions from a separate Databricks notebook

Question # 3

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Question # 4

A table nameduser_ltvis being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

Theuser_ltvtable has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Three columns will be returned, but one column will be named "redacted" and contain only null values.

Only the email and itv columns will be returned; the email column will contain all null values.

The email and ltv columns will be returned with the values in user itv.

The email, age. and ltv columns will be returned with the values in user ltv.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Question # 5

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: Unlimited

Cluster: New Job Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Question # 6

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table namedusers.

Assuming thatuser_idis a unique identifying key and thatdelete_requestscontains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

Question # 7

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint2.0/jobs/create.

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.

Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.

One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

Question # 8

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:

A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.

Which statement describes the outcome of this batch insert?

The write will fail when the violating record is reached; any records previously processed will be recorded to the target table.

The write will fail completely because of the constraint violation and no records will be inserted into the target table.

The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.

The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.

The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.

Question # 9

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is namedstore_saies_summaryand the schema is as follows:

The tabledaily_store_salescontains all the information needed to updatestore_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

Ifdaily_store_salesis implemented as a Type 1 table and thetotal_salescolumn might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in thestore_sales_summarytable?

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Explanation:

The daily_store_sales table contains all the information needed to update store_sales_summary. The schema of the table is:

store_id INT, sales_date DATE, total_sales FLOAT

The daily_store_sales table is implemented as a Type 1 table, which means that old values are overwritten by new values and no history is maintained. The total_sales column might be adjusted after manual data auditing, which means that the data in the table may change over time.

The safest approach to generate accurate reports in the store_sales_summary table is to use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update. Structured Streaming is a scalable and fault-tolerant stream processing engine built on Spark SQL. Structured Streaming allows processing data streams as if they were tables or DataFrames, using familiar operations such as select, filter, groupBy, or join. Structured Streaming also supports output modes that specify how to write the results of a streaming query to a sink, such as append, update, or complete. Structured Streaming can handle both streaming and batch data sources in a unified manner.

The change data feed is a feature of Delta Lake that provides structured streaming sources that can subscribe to changes made to a Delta Lake table. The change data feed captures both data changes and schema changes as ordered events that can be processed by downstream applications or services. The change data feed can be configured with different options, such as starting from a specific version or timestamp, filtering by operation type or partition values, or excluding no-op changes.

By using Structured Streaming to subscribe to the change data feed for daily_store_sales, one can capture and process any changes made to the total_sales column due to manual data auditing. By applying these changes to the aggregates in the store_sales_summary table with each update, one can ensure that the reports are always consistent and accurate with the latest data. Verified References: [Databricks Certified Data Engineer Professional], under “Spark Core” section; Databricks Documentation, under “Structured Streaming” section; Databricks Documentation, under “Delta Change Data Feed” section.

Question # 10

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should havefive workers,one driverof type i3.xlarge, and should use the '14.3.x-scala2.12' runtime.

Which command should the data engineer use?

databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster

databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

Free Practice Questions for Databricks Databricks-Certified-Professional-Data-Engineer Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: