Pre-Summer Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

Given the following error traceback (from display(df.select(3* " heartrate " ))) which shows AnalysisException: cannot resolve ' heartrateheartrateheartrate ' , which statement describes the error being raised?

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

A data engineer has created a transactions Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool that requires Apache Iceberg format.

What should the data engineer do?

A.

Require the analytics team to use a tool that supports Delta table.

B.

Enable uniform on the transactions table to ' iceberg ' so that the table can be read as an Iceberg table.

C.

Create an Iceberg copy of the transactions Delta table which can be used by the analytics team.

D.

Convert the transactions Delta table to Iceberg and enable uniform so that the table can be read as a Delta table.

A junior developer complains that the code in their notebook isn ' t producing the correct results in the development environment. A shared screenshot reveals that while they ' re using a notebook versioned with Databricks Repos, they ' re using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

A.

The total count of rows is calculated by scanning all data files

B.

The total count of rows will be returned from cached results unless REFRESH is run

C.

The total count of records is calculated from the Delta transaction logs

D.

The total count of records is calculated from the parquet file metadata

E.

The total count of records is calculated from the Hive metastore

How are the operational aspects of Lakeflow Declarative Pipelines different from Spark Structured Streaming ?

A.

Lakeflow Declarative Pipelines manage the orchestration of multi-stage pipelines automatically, while Structured Streaming requires external orchestration for complex dependencies.

B.

Structured Streaming can process continuous data streams, while Lakeflow Declarative Pipelines cannot.

C.

Lakeflow Declarative Pipelines can write to Delta Lake format, while Structured Streaming cannot.

D.

Lakeflow Declarative Pipelines automatically handle schema evolution, while Structured Streaming always requires manual schema management.

A data engineer is attempting to execute the following PySpark code:

df = spark.read.table( " sales " )

result = df.groupBy( " region " ).agg(sum( " revenue " ))

However, upon inspecting the execution plan and profiling the Spark job, they observe excessive data shuffling during the aggregation phase.

Which technique should be applied to reduce shuffling during the groupBy aggregation operation?

A.

Caching the DataFrame df.

B.

Repartition by region before aggregation.

C.

Use coalesce() after the aggregation.

D.

Use broadcast join.

A security analytics pipeline must enrich billions of raw connection logs with geolocation data. The join hinges on finding which IPv4 range each event’s address falls into.

Table 1: network_events (≈ 5 billion rows)

event_id ip_int

42 3232235777

Table 2: ip_ranges (≈ 2 million rows)

start_ip_int end_ip_int country

3232235520 3232236031 US

The query is currently very slow:

SELECT n.event_id, n.ip_int, r.country

FROM network_events n

JOIN ip_ranges r

ON n.ip_int BETWEEN r.start_ip_int AND r.end_ip_int;

Question:

Which change will most dramatically accelerate the query while preserving its logic?

A.

Increase spark.sql.shuffle.partitions from 200 to 10000.

B.

Add a range-join hint /*+ RANGE_JOIN(r, 65536) */.

C.

Force a sort-merge join with /*+ MERGE(r) */.

D.

Add a broadcast hint: /*+ BROADCAST(r) */ for ip_ranges.

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize and Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

A.

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

B.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

C.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

D.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

E.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

A data engineer has configured their Databricks Asset Bundle with multiple targets in databricks.yml and deployed it to the production workspace. Now, to validate the deployment, they need to invoke a job named my_project_job specifically within the prod target context. Assuming the job is already deployed, they need to trigger its execution while ensuring the target-specific configuration is respected.

Which command will trigger the job execution?

A.

databricks execute my_project_job -e prod

B.

databricks job run my_project_job --env prod

C.

databricks run my_project_job -t prod

D.

databricks bundle run my_project_job -t prod