Summer Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exc65

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

A.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

B.

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

C.

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")

D.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))

E.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join.

Find the error.

Code block:

A.

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 20)

B.

Spark will only broadcast DataFrames that are much smaller than the default value.

C.

The correct option to write configurations is through spark.config and not spark.conf.

D.

Spark will only apply the limit to threshold joins and not to other joins.

E.

The passed limit has the wrong variable type.

F.

The command is evaluated lazily and needs to be followed by an action.

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col('predError').in([3, 6])).count()

A.

The number of rows cannot be determined with the count() operator.

B.

Instead of filter, the select method should be used.

C.

The method used on column predError is incorrect.

D.

Instead of a list, the values need to be passed as single arguments to the in operator.

E.

Numbers 3 and 6 need to be passed as string variables.

Which of the elements in the labeled panels represent the operation performed for broadcast variables?

Larger image

A.

2, 5

B.

3

C.

2, 3

D.

1, 2

E.

1, 3, 4

The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.

Code block:

transactionsDf.join(itemsDf, "itemId", how="broadcast")

A.

The syntax is wrong, how= should be removed from the code block.

B.

The join method should be replaced by the broadcast method.

C.

Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.

D.

The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.

E.

broadcast is not a valid join type.

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

A.

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

B.

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

C.

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

D.

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

E.

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?

Entire DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

A.

transactionsDf.filter(col("transactionId").isin([3,4,6])).select([predError, value])

B.

transactionsDf.select(col("transactionId").isin([3,4,6]), "predError", "value")

C.

transactionsDf.filter("transactionId" % 2 == 0).select("predError", "value")

D.

transactionsDf.filter(col("transactionId") % 2 == 0).select("predError", "value")

(Correct)

E.

1.transactionsDf.createOrReplaceTempView("transactionsDf")

2.spark.sql("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")

F.

transactionsDf.filter(col(transactionId).isin([3,4,6]))

Which of the following code blocks returns a single-row DataFrame that only has a column corr which shows the Pearson correlation coefficient between columns predError and value in DataFrame

transactionsDf?

A.

transactionsDf.select(corr(["predError", "value"]).alias("corr")).first()

B.

transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first()

C.

transactionsDf.select(corr(predError, value).alias("corr"))

D.

transactionsDf.select(corr(col("predError"), col("value")).alias("corr"))

(Correct)

E.

transactionsDf.select(corr("predError", "value"))

Which of the following statements about storage levels is incorrect?

A.

The cache operator on DataFrames is evaluated like a transformation.

B.

In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.

C.

Caching can be undone using the DataFrame.unpersist() operator.

D.

MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.

E.

DISK_ONLY will not use the worker node's memory.

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.

Find the error.

Code block:

1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

A.

The commas in the tuples with the colors should be eliminated.

B.

The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.

C.

Instead of color, a data type should be specified.

D.

The "color" expression needs to be wrapped in brackets, so it reads ["color"].