Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

A.

The Spark engine requires manual intervention to start executing transformations.

B.

Only actions trigger the execution of the transformation pipeline.

C.

Transformations are executed immediately to build the lineage graph.

D.

The Spark engine optimizes the execution plan during the transformations, causing delays.

E.

Transformations are evaluated lazily.

An engineer has a large ORC file located at/file/test_data.orcand wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e.,col1,col2, during the reading process?

A.

spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")

B.

spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")

C.

spark.read.orc("/file/test_data.orc").selected("col1", "col2")

D.

spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

A.

[Row(name='bambi'), Row(name='alladin', age=20)]

B.

[Row(name='alladin', age=20)]

C.

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

D.

The code throws an error due to a schema mismatch.

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

A.

Optimize the data processing logic by repartitioning the DataFrame.

B.

Modify the Spark configuration to disable garbage collection

C.

Increase the memory allocated to the Spark Driver.

D.

Cache large DataFrames to persist them in memory.

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.

Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

A.

Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.

B.

Use spark.read.json() with the inferSchema option set to true

C.

Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.

D.

Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.

Given a DataFramedfthat has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

A.

10

B.

Same number as the cluster executors

C.

1

D.

20

A Spark application is experiencing performance issues in client mode because the driver is resource-constrained.

How should this issue be resolved?

A.

Add more executor instances to the cluster

B.

Increase the driver memory on the client machine

C.

Switch the deployment mode to cluster mode

D.

Switch the deployment mode to local mode

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

A.

It provides a way to run Spark applications remotely in any programming language

B.

It can be used to interact with any remote cluster using the REST API

C.

It allows for remote execution of Spark jobs

D.

It is primarily used for data ingestion into Spark from external sources

A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:

Which operation is supported with streamingdf ?

A.

streaming_df. select (countDistinct ("Name") )

B.

streaming_df.groupby("Id") .count ()

C.

streaming_df.orderBy("timestamp").limit(4)

D.

streaming_df.filter (col("count") < 30).show()

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

A.

Execute their pyspark shell with the option--remote "https://localhost"

B.

Execute their pyspark shell with the option--remote "sc://localhost"

C.

Set the environment variableSPARK_REMOTE="sc://localhost"before starting the pyspark shell

D.

Add.remote("sc://localhost")to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark propertyspark.connect.grpc.binding.portis set to 15002 in the application code