An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.
How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?
In the code block below,aggDFcontains aggregations on a streaming DataFrame:
Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?
A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:
The resulting Python dictionary must contain a mapping of region-> region id containing the smallest 3 region_idvalues.
Which code fragment meets the requirements?
A)
B)
C)
D)
The resulting Python dictionary must contain a mapping ofregion -> region_idfor the smallest 3region_idvalues.
Which code fragment meets the requirements?
A data engineer is working on the DataFrame:
(Referring to the table image: it has columnsId,Name,count, andtimestamp.)
Which code fragment should the engineer use to extract the unique values in theNamecolumn into an alphabetically ordered list?
A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.
Which operation is AQE implementing to improve performance?
A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing agroupByoperation in parallel across all workers in Apache Spark 3.5?
A)
Use the applylnPandas API
B)
C)
D)
What is the difference betweendf.cache()anddf.persist()in Spark DataFrame?
A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.
Which operation results in a shuffle and a new stage?
What is the benefit of using Pandas on Spark for data transformations?
Options:
A data analyst wants to add a column date derived from a timestamp column.
Options: