Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of DataFrame transactionsDf, and null if predError is null?
Sample of DataFrame transactionsDf: 1.+-------------+---------+-----+-------+---------+----+ 2.|transactionId|predError|value|storeId|productId| f| 3.+-------------+---------+-----+-------+---------+----+ 4.| 1| 3| 4| 25| 1|null| 5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+
A. 1.def count_to_target(target):
2.
if target is None:
3.
return
4.
5.
result = [range(target)]
6.
return result
7.
8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType]) 9.
10.transactionsDf.select(count_to_target_udf(col('predError')))
B. 1.def count_to_target(target):
2.
if target is None:
3.
return
4.
5.
result = list(range(target))
6.
return result
7.
8.transactionsDf.select(count_to_target(col('predError')))
C. 1.def count_to_target(target):
2.
if target is None:
3.
return
4.
5.
result = list(range(target))
6.
return result
7.
8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType())) 9.
10.transactionsDf.select(count_to_target_udf('predError')) (Correct)
D. 1.def count_to_target(target):
2.
result = list(range(target))
3.
return result
4.
5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType())) 6.
7.df = transactionsDf.select(count_to_target_udf('predError'))
E. 1.def count_to_target(target):
2.
if target is None:
3.
return
4.
5.
result = list(range(target))
6.
return result
7.
8.count_to_target_udf = udf(count_to_target)
9.
10.transactionsDf.select(count_to_target_udf('predError'))
Which of the following statements about Spark's configuration properties is incorrect?
A. The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.
B. The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.
C. The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.
D. The default number of partitions to use when shuffling data for joins or aggregations is 300.
E. The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.
Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?
A. spark.read.json(filePath)
B. spark.read.path(filePath, source="json")
C. spark.read().path(filePath)
D. spark.read().json(filePath)
E. spark.read.path(filePath)
Which of the following statements about DAGs is correct?
A. DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.
B. DAG stands for "Directing Acyclic Graph".
C. Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.
D. In contrast to transformations, DAGs are never lazily executed.
E. DAGs can be decomposed into tasks that are executed in parallel.
Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?
Schema of first partition:
1.root
2.
|-- transactionId: integer (nullable = true)
3.
|-- predError: integer (nullable = true)
4.
|-- value: integer (nullable = true)
5.
|-- storeId: integer (nullable = true)
6.
|-- productId: integer (nullable = true)
7.
|-- f: integer (nullable = true)
Schema of second partition:
1.root
2.
|-- transactionId: integer (nullable = true)
3.
|-- predError: integer (nullable = true)
4.
|-- value: integer (nullable = true)
5.
|-- storeId: integer (nullable = true)
6.
|-- rollId: integer (nullable = true)
7.
|-- f: integer (nullable = true)
8.
|-- tax_id: integer (nullable = false)
A. spark.read.parquet(filePath, mergeSchema='y')
B. spark.read.option("mergeSchema", "true").parquet(filePath)
C. spark.read.parquet(filePath)
D. 1.nx = 0 2.for file in dbutils.fs.ls(filePath):
3.
if not file.name.endswith(".parquet"):
4.
continue
5.
df_temp = spark.read.parquet(file.path)
6.
if nx == 0:
7.
df = df_temp
8.
else:
9.
df = df.union(df_temp)
10.
nx = nx+1
11.df
E. 1.nx = 0 2.for file in dbutils.fs.ls(filePath):
3.
if not file.name.endswith(".parquet"):
4.
continue
5.
df_temp = spark.read.parquet(file.path)
6.
if nx == 0:
7.
df = df_temp
8.
else:
9.
df = df.join(df_temp, how="outer")
10.
nx = nx+1
11.df
Which of the following statements about RDDs is incorrect?
A. An RDD consists of a single partition.
B. The high-level DataFrame API is built on top of the low-level RDD API.
C. RDDs are immutable.
D. RDD stands for Resilient Distributed Dataset.
E. RDDs are great for precisely instructing Spark on how to do a query.
Which of the following describes the role of the cluster manager?
A. The cluster manager schedules tasks on the cluster in client mode.
B. The cluster manager schedules tasks on the cluster in local mode.
C. The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.
D. The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode.
E. The cluster manager allocates resources to the DataFrame manager.
Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below?
Sample of itemsDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+
A. 1.itemsDfSchema = StructType([
2.
StructField("itemId", IntegerType()),
3.
StructField("attributes", StringType()),
4.
StructField("supplier", StringType())])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)
B. 1.itemsDfSchema = StructType([
2.
StructField("itemId", IntegerType),
3.
StructField("attributes", ArrayType(StringType)),
4.
StructField("supplier", StringType)])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)
C. 1.itemsDf = spark.read.schema('itemId integer, attributes
D. 1.itemsDfSchema = StructType([
2.
StructField("itemId", IntegerType()),
3.
StructField("attributes", ArrayType(StringType())),
4.
StructField("supplier", StringType())])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)
E. 1.itemsDfSchema = StructType([
2.
StructField("itemId", IntegerType()),
3.
StructField("attributes", ArrayType([StringType()])),
4.
StructField("supplier", StringType())])
5.
6.itemsDf = spark.read(schema=itemsDfSchema).parquet(filePath)
Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000row DataFrame itemsDf, without any duplicates, returning the same rows even if the code block is run twice?
A. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)
B. itemsDf.sample(fraction=0.1, seed=87238)
C. itemsDf.sample(fraction=1000, seed=98263)
D. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
E. itemsDf.sample(fraction=0.1)
The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.
Code block:
A. transactionsDf.format("parquet").option("mode", "append").save(path)
B. The code block is missing a reference to the DataFrameWriter.
C. save() is evaluated lazily and needs to be followed by an action.
D. The mode option should be omitted so that the command uses the default mode.
E. The code block is missing a bucketBy command that takes care of partitions.
F. Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.