Week End Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

A.

Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

B.

Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.

C.

Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.

D.

Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

E.

Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

A.

They are merged.

B.

They are ignored.

C.

They are updated.

D.

They are inserted.

E.

They are deleted.

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

A.

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

B.

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

C.

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

D.

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

E.

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

A.

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

B.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

C.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

D.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

E.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Which statement characterizes the general programming model used by Spark Structured Streaming?

A.

Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.

B.

Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.

C.

Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.

D.

Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

E.

Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?

A.

A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.

B.

The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.

C.

An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.

D.

An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.

E.

No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.

Given the following error traceback (from display(df.select(3*"heartrate"))) which shows AnalysisException: cannot resolve 'heartrateheartrateheartrate', which statement describes the error being raised?

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

A.

The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.

B.

Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

C.

Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.

D.

Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.

E.

Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Which statement describes the default execution mode for Databricks Auto Loader?

A.

New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

B.

Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

C.

Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

D.

New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.