Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

A.

Locate the executor logs on the Spark master node, typically under the/tmpdirectory.

B.

Use the commandspark-submitwith the—verboseflag to print the logs to the console.

C.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

D.

Fetch the logs by running a Spark job with thespark-sqlCLI tool.

In the code block below,aggDFcontains aggregations on a streaming DataFrame:

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

A.

complete

B.

append

C.

replace

D.

aggregate

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region-> region id containing the smallest 3 region_idvalues.

Which code fragment meets the requirements?

A)

B)

C)

D)

The resulting Python dictionary must contain a mapping ofregion -> region_idfor the smallest 3region_idvalues.

Which code fragment meets the requirements?

A.

regions = dict(

regions_df

.select('region', 'region_id')

.sort('region_id')

.take(3)

)

B.

regions = dict(

regions_df

.select('region_id', 'region')

.sort('region_id')

.take(3)

)

C.

regions = dict(

regions_df

.select('region_id', 'region')

.limit(3)

.collect()

)

D.

regions = dict(

regions_df

.select('region', 'region_id')

.sort(desc('region_id'))

.take(3)

)

A data engineer is working on the DataFrame:

(Referring to the table image: it has columnsId,Name,count, andtimestamp.)

Which code fragment should the engineer use to extract the unique values in theNamecolumn into an alphabetically ordered list?

A.

df.select("Name").orderBy(df["Name"].asc())

B.

df.select("Name").distinct().orderBy(df["Name"])

C.

df.select("Name").distinct()

D.

df.select("Name").distinct().orderBy(df["Name"].desc())

A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.

Which operation is AQE implementing to improve performance?

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing agroupByoperation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

C)

D)

A.

Use theapplyInPandasAPI:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

B.

Use themapInPandasAPI:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

What is the difference betweendf.cache()anddf.persist()in Spark DataFrame?

A.

Bothcache()andpersist()can be used to set the default storage level (MEMORY_AND_DISK_SER)

B.

Both functions perform the same operation. Thepersist()function provides improved performance asits default storage level isDISK_ONLY.

C.

persist()- Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) andcache()- Can be used to set different storage levels to persist the contents of the DataFrame.

D.

cache()- Persists the DataFrame with the default storage level (MEMORY_AND_DISK) andpersist()- Can be used to set different storage levels to persist the contents of the DataFrame

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.

Which operation results in a shuffle and a new stage?

A.

DataFrame.groupBy().agg()

B.

DataFrame.filter()

C.

DataFrame.withColumn()

D.

DataFrame.select()

What is the benefit of using Pandas on Spark for data transformations?

Options:

A.

It is available only with Python, thereby reducing the learning curve.

B.

It computes results immediately using eager execution, making it simple to use.

C.

It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost-efficient.

D.

It executes queries faster using all the available cores in the cluster as well as provides Pandas’s rich set of features.

A data analyst wants to add a column date derived from a timestamp column.

Options:

A.

dates_df.withColumn("date", f.unix_timestamp("timestamp")).show()

B.

dates_df.withColumn("date", f.to_date("timestamp")).show()

C.

dates_df.withColumn("date", f.date_format("timestamp", "yyyy-MM-dd")).show()

D.

dates_df.withColumn("date", f.from_unixtime("timestamp")).show()