Month End Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

A.

df.orderBy(col("age").asc(), col("salary").asc()).show()

B.

df.sort("age", "salary", ascending=[True, True]).show()

C.

df.sort("age", "salary", ascending=[False, True]).show()

D.

df.orderBy("age", "salary", ascending=[True, False]).show()

The following code fragment results in an error:

Which code fragment should be used instead?

A)

B)

C)

D)

14 of 55.

A developer created a DataFrame with columns color, fruit, and taste, and wrote the data to a Parquet directory using:

df.write.partitionBy("color", "taste").parquet("/path/to/output")

What is the result of this code?

A.

It appends new partitions to an existing Parquet file.

B.

It throws an error if there are null values in either partition column.

C.

It creates separate directories for each unique combination of color and taste.

D.

It stores all data in a single Parquet file.

18 of 55.

An engineer has two DataFrames — df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:

from pyspark.sql.functions import broadcast

df_result = df2.join(broadcast(df1), on="id", how="inner")

What is the purpose of using broadcast() in this scenario?

A.

It increases the partition size for df1 and df2.

B.

It ensures that the join happens only when the id values are identical.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It filters the id values before performing the join.

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

A.

[Row(name='bambi'), Row(name='alladin', age=20)]

B.

[Row(name='alladin', age=20)]

C.

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

D.

The code throws an error due to a schema mismatch.

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

A.

A job contains multiple stages, and each stage contains multiple tasks.

B.

A job contains multiple tasks, and each task contains multiple stages.

C.

A stage contains multiple jobs, and each job contains multiple tasks.

D.

A stage contains multiple tasks, and each task contains multiple jobs.

Given the schema:

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

A.

dropDuplicates on all columns (wrong criteria)

B.

dropDuplicates with no arguments (removes based on all columns)

C.

groupBy without aggregation (invalid use)

D.

dropDuplicates on the exact matching fields

1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.

The first attempt does read the text files, but each record contains a single line. This code is shown below:

txt_path = "/datasets/raw_txt/*"

df = spark.read.text(txt_path) # one row per line by default

df = df.withColumn("file_path", input_file_name()) # add full path

Which code change can be implemented in a DataFrame that meets the data scientist's requirements?

A.

Add the option wholetext to the text() function.

B.

Add the option lineSep to the text() function.

C.

Add the option wholetext=False to the text() function.

D.

Add the option lineSep=", " to the text() function.

Given:

python

CopyEdit

spark.sparkContext.setLogLevel("")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

A.

ALL, DEBUG, FAIL, INFO

B.

ERROR, WARN, TRACE, OFF

C.

WARN, NONE, ERROR, FATAL

D.

FATAL, NONE, INFO, DEBUG

A Data Analyst needs to retrieve employees with 5 or more years of tenure.

Which code snippet filters and shows the list?

A.

employees_df.filter(employees_df.tenure >= 5).show()

B.

employees_df.where(employees_df.tenure >= 5)

C.

filter(employees_df.tenure >= 5)

D.

employees_df.filter(employees_df.tenure >= 5).collect()