Week End Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.

Which code snippet can be used to meet this requirement?

A.

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

B.

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

C.

df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")

D.

df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

A.

The Spark engine requires manual intervention to start executing transformations.

B.

Only actions trigger the execution of the transformation pipeline.

C.

Transformations are executed immediately to build the lineage graph.

D.

The Spark engine optimizes the execution plan during the transformations, causing delays.

E.

Transformations are evaluated lazily.

A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes. What will be the outcome?

A.

The job may fail if the memory on each executor is not large enough to accommodate the DataFrame being broadcasted

B.

The job may fail if the executors do not have enough CPU cores to process the broadcasted dataset

C.

The job will hang indefinitely as Spark will struggle to distribute and serialize such a large broadcast variable to all executors

D.

The job may fail because the driver does not have enough CPU cores to serialize the large DataFrame

44 of 55.

A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.

They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

A.

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

B.

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

C.

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

D.

query = df.writeStream \

.outputMode("append") \

.start()

41 of 55.

A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.

The DataFrame has columns:

id | Name | count | timestamp

---------------------------------

1 | USA | 10

2 | India | 20

3 | England | 50

4 | India | 50

5 | France | 20

6 | India | 10

7 | USA | 30

8 | USA | 40

Which code fragment should the engineer use to sort the data in the Name and count columns?

A.

df1.orderBy(col("count").desc(), col("Name").asc())

B.

df1.sort("Name", "count")

C.

df1.orderBy("Name", "count")

D.

df1.orderBy(col("Name").desc(), col("count").asc())

In the code block below, aggDF contains aggregations on a streaming DataFrame:

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

A.

complete

B.

append

C.

replace

D.

aggregate

A data engineer writes the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv") # ~10 GB

df2 = spark.read.csv("product_data.csv") # ~8 MB

result = df1.join(df2, df1.product_id == df2.product_id)

Which join strategy will Spark use?

A.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan

B.

Broadcast join, as df2 is smaller than the default broadcast threshold

C.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently

D.

Shuffle join because no broadcast hints were provided

40 of 55.

A developer wants to refactor older Spark code to take advantage of built-in functions introduced in Spark 3.5.

The original code:

from pyspark.sql import functions as F

min_price = 110.50

result_df = prices_df.filter(F.col("price") > min_price).agg(F.count("*"))

Which code block should the developer use to refactor the code?

A.

result_df = prices_df.filter(F.col("price") > F.lit(min_price)).agg(F.count("*"))

B.

result_df = prices_df.where(F.lit("price") > min_price).groupBy().count()

C.

result_df = prices_df.withColumn("valid_price", when(col("price") > F.lit(min_price), True))

D.

result_df = prices_df.filter(F.lit(min_price) > F.col("price")).count()

45 of 55.

Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?

A.

It is primarily used for data ingestion into Spark from external sources.

B.

It provides a way to run Spark applications remotely in any programming language.

C.

It can be used to interact with any remote cluster using the REST API.

D.

It allows for remote execution of Spark jobs.

A Spark DataFrame df is cached using the MEMORY_AND_DISK storage level, but the DataFrame is too large to fit entirely in memory.

What is the likely behavior when Spark runs out of memory to store the DataFrame?

A.

Spark duplicates the DataFrame in both memory and disk. If it doesn't fit in memory, the DataFrame is stored and retrieved from the disk entirely.

B.

Spark splits the DataFrame evenly between memory and disk, ensuring balanced storage utilization.

C.

Spark will store as much data as possible in memory and spill the rest to disk when memory is full, continuing processing with performance overhead.

D.

Spark stores the frequently accessed rows in memory and less frequently accessed rows on disk, utilizing both resources to offer balanced performance.