Pre-Summer Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

A platform team is creating a standardized template for Databricks Asset Bundles to support CI/CD. The template must specify defaults for artifacts, workspace root paths, and a run identity, while allowing a “dev” target to be the default and override specific paths.

How should the team use databricks.yml to satisfy these requirements?

A.

Use deployment, builds, context, identity, and environments; set dev as default environment and override paths under builds.

B.

Use roots, modules, profiles, actor, and targets; where profiles contain workspace and artifacts defaults and actor sets run identity.

C.

Use project, packages, environment, identity, and stages; set dev as default stage and override workspace under environment.

D.

Use bundle, artifacts, workspace, run_as, and targets at the top level; set one target with default: true and override workspace paths or artifacts under that target.

A data engineer is building a Lakeflow Declarative Pipelines pipeline to process healthcare claims data. A metadata JSON file defines data quality rules for multiple tables, including:

{

" claims " : [

{ " name " : " valid_patient_id " , " constraint " : " patient_id IS NOT NULL " },

{ " name " : " non_negative_amount " , " constraint " : " claim_amount > = 0 " }

]

}

The pipeline must dynamically apply these rules to the claims table without hardcoding the rules.

How should the data engineer achieve this?

A.

Load the JSON metadata, loop through its entries, and apply expectations using dlt.expect_all.

B.

Invoke an external API to validate records against the metadata rules.

C.

Reference each expectation with @dlt.expect decorators in the table declaration.

D.

Use a SQL CONSTRAINT block referencing the JSON file path.

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible.

A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have

Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

A.

Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.

B.

All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.

C.

In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

D.

Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

A.

Task queueing resulting from improper thread pool assignment.

B.

Spill resulting from attached volume storage being too small.

C.

Network latency due to some cluster nodes being in different regions from the source data

D.

Skew caused by more data being assigned to a subset of spark-partitions.

E.

Credential validation errors while pulling data from an external system.

Which method can be used to determine the total wall-clock time it took to execute a query?

A.

In the Spark UI, take the job duration of the longest-running job associated with that query.

B.

In the Spark UI, take the sum of all task durations that ran across all stages for all jobs associated with that query.

C.

Open the Query Profiler associated with that query and use the Total wall-clock duration metric.

D.

Open the Query Profiler associated with that query and use the Aggregated task time metric.

Which statement describes integration testing?

A.

Validates interactions between subsystems of your application

B.

Requires an automated testing framework

C.

Requires manual intervention

D.

Validates an application use case

E.

Validates behavior of individual elements of your application

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

A.

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

B.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

C.

Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

D.

Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

E.

Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

A data engineer wants to enforce the principle of least privilege when configuring ACLs for Databricks jobs in a collaborative workspace.

Which approach should the data engineer use?

A.

Grant CAN RUN permission to everyone and CAN MANAGE to a single admin group.

B.

Use only folder-level permissions and avoid setting permissions on individual jobs.

C.

Grant all users CAN MANAGE permission on all jobs to avoid access issues.

D.

Assign users only the minimum permission level (e.g., CAN RUN or CAN VIEW) required for their role on each job.

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

A.

Regex

B.

Julia

C.

pyspsark.ml.feature

D.

Scala Datasets

E.

C++

A data governance team at a large enterprise is improving data discoverability across its organization. The team has hundreds of tables in their Databricks Lakehouse with thousands of columns that lack proper documentation. Many of these tables were created by different teams over several years, with missing context about column meanings and business logic. The data governance team needs to quickly generate comprehensive column descriptions for all existing tables to meet compliance requirements and improve data literacy across the organization. They want to leverage modern capabilities to automatically generate meaningful descriptions rather than manually documenting each column, which would take months to complete.

Which approach should the team use in Databricks to automatically generate column comments and descriptions for existing tables?

A.

Navigate to the table in Databricks Catalog Explorer, select the table schema view, and use the AI Generate option which leverages artificial intelligence to automatically create meaningful column descriptions based on column names, data types, sample values, and data patterns.

B.

Use Delta Lake’s DESCRIBE HISTORY command to analyze table evolution and infer column purposes from historical changes.

C.

Use the DESCRIBE TABLE command to extract existing schema information and manually write descriptions based on column names and data types.

D.

Write custom PySpark code using df.describe() and df.schema to programmatically generate basic statistical descriptions for each column.