Pre-Summer Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

A.

Stage’s detail screen and Executor’s files

B.

Stage’s detail screen and Query’s detail screen

C.

Driver’s and Executor’s log files

D.

Executor’s detail screen and Executor’s log files

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

A.

Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

B.

Z-order indices calculated on the table are preventing file compaction

C Bloom filler indices calculated on the table are preventing file compaction

C.

Databricks has autotuned to a smaller target file size based on the overall size of data in the table

D.

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders ?

A.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

B.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

C.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

D.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

E.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

A.

Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

B.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.

C.

Schedule a job to execute the pipeline once hour on a new job cluster.

D.

Configure a job that executes every time new data lands in a given directory.

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

A.

Three columns will be returned, but one column will be named " redacted " and contain only null values.

B.

Only the email and itv columns will be returned; the email column will contain all null values.

C.

The email and ltv columns will be returned with the values in user itv.

D.

The email, age. and ltv columns will be returned with the values in user ltv.

E.

Only the email and ltv columns will be returned; the email column will contain the string " REDACTED " in each row.

A Data Engineer is building a simple data pipeline using Lakeflow Declarative Pipelines (LDP) in Databricks to ingest customer data. The raw customer data is stored in a cloud storage location in JSON format. The task is to create Lakeflow Declarative Pipelines that read the raw JSON data and write it into a Delta table for further processing.

Which code snippet will correctly ingest the raw JSON data and create a Delta table using LDP?

A.

import dlt

@dlt.table

def raw_customers():

return spark.read.format( " csv " ).load( " s3://my-bucket/raw-customers/ " )

B.

import dlt

@dlt.table

def raw_customers():

return spark.read.json( " s3://my-bucket/raw-customers/ " )

C.

import dlt

@dlt.table

def raw_customers():

return spark.read.format( " parquet " ).load( " s3://my-bucket/raw-customers/ " )

D.

import dlt

@dlt.view

def raw_customers():

return spark.format.json( " s3://my-bucket/raw-customers/ " )

A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they are currently flagging those rows with a warning and writing them to the silver table along with the good data. They’ve been given a new requirement – the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.

This is the existing code for their silver table:

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

What code will satisfy the requirements?

A.

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

B.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading < 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

C.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect_or_drop( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

D.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

A data engineer is creating a data ingestion pipeline to understand where customers are taking their rented bicycles during use. The engineer noticed that, over time, data being transmitted from the bicycle sensors fail to include key details like latitude and longitude. Downstream analysts need both the clean records and the quarantined records available for separate processing.

The data engineer already has this code:

import dlt

from pyspark.sql.functions import expr

rules = {

" valid_lat " : " (lat IS NOT NULL) " ,

" valid_long " : " (long IS NOT NULL) "

}

quarantine_rules = " NOT({}) " .format( " AND " .join(rules.values()))

@dlt.view

def raw_trips_data():

return spark.readStream.table( " ride_and_go.telemetry.trips " )

How should the data engineer meet the requirements to capture good and bad data?

A.

@dlt.table(name= " trips_data_quarantine " )

def trips_data_quarantine():

return (

spark.readStream.table( " raw_trips_data " )

.filter(expr(quarantine_rules))

)

B.

@dlt.view

@dlt.expect_or_drop( " lat_long_present " , " (lat IS NOT NULL AND long IS NOT NULL) " )

def trips_data_quarantine():

return spark.readStream.table( " ride_and_go.telemetry.trips " )

C.

@dlt.table

@dlt.expect_all_or_drop(rules)

def trips_data_quarantine():

return spark.readStream.table( " raw_trips_data " )

D.

@dlt.table(partition_cols=[ " is_quarantined " , ])

@dlt.expect_all(rules)

def trips_data_quarantine():

return (

spark.readStream.table( " raw_trips_data " )

.withColumn( " is_quarantined " , expr(quarantine_rules))

)

Which statement describes the correct use of pyspark.sql.functions.broadcast?

A.

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

B.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

C.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

D.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

E.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table.

Given the current implementation, which method can be used?

A.

Parse the Delta Lake transaction log to identify all newly written data files.

B.

Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.

C.

Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.

D.

Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.