Pay attention to our Valid and Useful Exam Reviews and take our Exam Torrent as your Study Material. With little time and energy investment, you have a High Efficiency Study experience. Pass your Actual Test with the help of our Actual Reviews.

Latest Apr-2023 Databricks Associate-Developer-Apache-Spark Dumps Updated 179 Questions [Q51-Q70]

Share

Latest Apr-2023 Databricks Associate-Developer-Apache-Spark Dumps Updated 179 Questions

PDF Download Free of Associate-Developer-Apache-Spark Valid Practice Test Questions


The best way to study for a Databricks Associate Developer Apache Spark Exam is by getting as many

Many of the questions you will face when taking the Databricks Associate Developer exam are based on real-world scenarios that can only be simulated in the Databricks environment. Our team of subject matter experts have designed a series of practice exams that will help you prepare for this exam. With our online practice exams, you can simulate the actual Databricks environment and learn from your mistakes while working your way through the questions. Databricks Associate Developer Apache Spark exam dumps will save your time and money.

We developed the online test platform because we wanted to make sure that you could practice on your own schedule. You can take the test anytime, and you can retake it as many times as you like.

In conclusion, the best way to learn something is to practice it. If you're a beginner, it's recommended that you start with the free practice exams available on our website. Once you've mastered the fundamentals, you can move on to the official Databricks Associate Developer Apache Spark exam prep materials. They come with an accompanying practice test. You'll get the chance to test your knowledge before the actual exam. This will help you know if you have what it takes to pass the real exam. If you do, you can skip the official exam prep materials and focus on learning the concepts covered in the practice test.


What is the Databricks Associate Developer Apache Spark Exam?

The Databricks Associate Developer Apache Spark Exam is a certification that can be earned by anyone who has successfully completed the Databricks Associate Developer Apache Spark Certification Training. The exam covers all the material that was covered in the training. The exam is designed to test your knowledge of the concepts, skills, and abilities that you learned during the course.

Do you want to become a Data Engineer or a Spark Architect? If so, then the Databricks Associate Developer Apache Spark Exam is a must-pass. The Databricks Associate Developer Apache Spark Exam is designed to help you develop a complete understanding of the technology used by the Databricks platform. You will learn about the basics of Spark, including the Spark programming language, Spark SQL, Spark Streaming, and the Spark ecosystem. Databricks Associate Developer Apache Spark exam dumps are the choice of champions.

The Databricks Associate Developer Apache Spark Exam is a test that aims to assess whether you have the knowledge required to become a certified Apache Spark developer. The Databricks Associate Developer Apache Spark Exam consists of two parts: the first part tests your knowledge of the fundamentals of the Apache Spark framework and the second part tests your ability to apply this knowledge. This post will help you get a head start in preparing for the Databricks Associate Developer Apache Spark Exam. The executors disk division actions documentation frame for the executor syntax variables object return allowed partition for the fit output transformation to induce couple of manager and evaluated expected safely, lazily named nodes broadcast operations for correctly mock driver.


The Exam cost of Databricks Associate Developer Apache Spark Exam?

The cost of the Databricks Associate Developer Apache Spark Exam is 200 USD per attempt.

 

NEW QUESTION 51
The code block shown below should write DataFrame transactionsDf as a parquet file to path storeDir, using brotli compression and replacing any previously existing file. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__.format("parquet").__2__(__3__).option(__4__, "brotli").__5__(storeDir)

  • A. 1. write
    2. mode
    3. "overwrite"
    4. "compression"
    5. save
    (Correct)
  • B. 1. save
    2. mode
    3. "replace"
    4. "compression"
    5. path
  • C. 1. save
    2. mode
    3. "ignore"
    4. "compression"
    5. path
  • D. 1. write
    2. mode
    3. "overwrite"
    4. compression
    5. parquet
  • E. 1. store
    2. with
    3. "replacement"
    4. "compression"
    5. path

Answer: B

Explanation:
Explanation
Correct code block:
transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir) Solving this question requires you to know how to access the DataFrameWriter (link below) from the DataFrame API - through DataFrame.write.
Another nuance here is about knowing the different modes available for writing parquet files that determine Spark's behavior when dealing with existing files. These, together with the compression options are explained in the DataFrameWriter.parquet documentation linked below.
Finally, bracket __5__ poses a certain challenge. You need to know which command you can use to pass down the file path to the DataFrameWriter. Both save and parquet are valid options here.
More info:
- DataFrame.write: pyspark.sql.DataFrame.write - PySpark 3.1.1 documentation
- DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 52
The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')

  • A. 1. withColumn
    2. col(associateId)
    3. lit(5)
    4. drop
    5. col(productId)
  • B. 1. withColumn
    2. 'associateId'
    3. 5
    4. remove
    5. 'productId'
  • C. 1. withNewColumn
    2. associateId
    3. lit(5)
    4. drop
    5. productId
  • D. 1. withColumn
    2. 'associateId'
    3. lit(5)
    4. drop
    5. 'productId'
  • E. 1. withColumnRenamed
    2. 'associateId'
    3. 5
    4. drop
    5. 'productId'

Answer: D

Explanation:
Explanation
Correct code block:
transactionsDf.withColumn('associateId', lit(5)).drop('productId', 'value') For solving this question it is important that you know the lit() function (link to documentation below). This function enables you to add a column of a constant value to a DataFrame.
More info: pyspark.sql.functions.lit - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 53
The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column storeId as key for partitioning. Find the error.
Code block:
transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_split")A.

  • A. Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
  • B. Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
  • C. The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").
  • D. partitionOn("storeId") should be called before the write operation.
  • E. The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.

Answer: B

Explanation:
Explanation
Correct code block:
transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_split") More info: partition by - Reading files which are written using PartitionBy or BucketBy in Spark - Stack Overflow Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 54
Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?

  • A. transactionsDf.select(col("storeId").distinct())
  • B. transactionsDf.distinct("storeId")
  • C. transactionsDf.filter("storeId").distinct()
  • D. transactionsDf.select("storeId").distinct()
    (Correct)
  • E. transactionsDf["storeId"].distinct()

Answer: D

Explanation:
Explanation
distinct() is a method of a DataFrame. Knowing this, or recognizing this from the documentation, is the key to solving this question.
More info: pyspark.sql.DataFrame.distinct - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 55
Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame itemsDf from last to first one in the alphabet?
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+

  • A. itemsDf.withColumn('attributes', sort_array(desc('attributes')))
  • B. itemsDf.select(sort_array("attributes"))
  • C. itemsDf.withColumn("attributes", sort_array("attributes", asc=False))
  • D. itemsDf.withColumn('attributes', sort_array(col('attributes').desc()))
  • E. itemsDf.withColumn('attributes', sort(col('attributes'), asc=False))

Answer: C

Explanation:
Explanation
Output of correct code block:
+------+-----------------------------+-------------------+
|itemId|attributes |supplier |
+------+-----------------------------+-------------------+
|1 |[winter, cozy, blue] |Sports Company Inc.|
|2 |[summer, red, fresh, cooling]|YetiX |
|3 |[travel, summer, green] |Sports Company Inc.|
+------+-----------------------------+-------------------+
It can be confusing to differentiate between the different sorting functions in PySpark. In this case, a particularity about sort_array has to be considered: The sort direction is given by the second argument, not by the desc method. Luckily, this is documented in the documentation (link below). Also, for solving this question you need to understand the difference between sort and sort_array. With sort, you cannot sort values in arrays. Also, sort is a method of DataFrame, while sort_array is a method of pyspark.sql.functions.
More info: pyspark.sql.functions.sort_array - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 56
The code block displayed below contains an error. The code block is intended to perform an outer join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.
Find the error.
Code block:
transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")

  • A. The "outer" argument should be eliminated from the call and join should be replaced by joinOuter.
  • B. The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.col("itemId") == transactionsDf.col("productId").
  • C. The join type needs to be appended to the join() operator, like join().outer() instead of listing it as the last argument inside the join() call.
  • D. The "outer" argument should be eliminated, since "outer" is the default join type.
  • E. The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.itemId == transactionsDf.productId.

Answer: E

Explanation:
Explanation
Correct code block:
transactionsDf.join(itemsDf, itemsDf.itemId == transactionsDf.productId, "outer") Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/33.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

 

NEW QUESTION 57
The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(avg('predError'))

  • A. 1. sample
    2. False
    3. 0.15
    4. select
  • B. 1. sample
    2. True
    3. 0.15
    4. filter
  • C. 1. fraction
    2. False
    3. 0.85
    4. select
  • D. 1. sample
    2. 0.85
    3. False
    4. select
  • E. 1. fraction
    2. 0.15
    3. True
    4. where

Answer: A

Explanation:
Explanation
Correct code block:
transactionsDf.sample(withReplacement=False, fraction=0.15).select(avg('predError')) You should remember that getting a random subset of rows means sampling. This, in turn should point you to the DataFrame.sample() method. Once you know this, you can look up the correct order of arguments in the documentation (link below).
Lastly, you have to decide whether to use filter, where or select. where is just an alias for filter(). filter() is not the correct method to use here, since it would only allow you to filter rows based on some condition. However, the question asks to return only the average prediction error. You can control the columns that a query returns with the select() method - so this is the correct method to use here.
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 58
Which of the following are valid execution modes?

  • A. Client, Cluster, Local
  • B. Server, Standalone, Client
  • C. Standalone, Client, Cluster
  • D. Kubernetes, Local, Client
  • E. Cluster, Server, Local

Answer: A

Explanation:
Explanation
This is a tricky question to get right, since it is easy to confuse execution modes and deployment modes. Even in literature, both terms are sometimes used interchangeably.
There are only 3 valid execution modes in Spark: Client, cluster, and local execution modes. Execution modes do not refer to specific frameworks, but to where infrastructure is located with respect to each other.
In client mode, the driver sits on a machine outside the cluster. In cluster mode, the driver sits on a machine inside the cluster. Finally, in local mode, all Spark infrastructure is started in a single JVM (Java Virtual Machine) in a single computer which then also includes the driver.
Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses specific frameworks outside Spark. Valid deployment modes are standalone, Apache YARN, Apache Mesos and Kubernetes.
Client, Cluster, Local
Correct, all of these are the valid execution modes in Spark.
Standalone, Client, Cluster
No, standalone is not a valid execution mode. It is a valid deployment mode, though.
Kubernetes, Local, Client
No, Kubernetes is a deployment mode, but not an execution mode.
Cluster, Server, Local
No, Server is not an execution mode.
Server, Standalone, Client
No, standalone and server are not execution modes.
More info: Apache Spark Internals - Learning Journal

 

NEW QUESTION 59
Which of the following code blocks reads all CSV files in directory filePath into a single DataFrame, with column names defined in the CSV file headers?
Content of directory filePath:
1._SUCCESS
2._committed_2754546451699747124
3._started_2754546451699747124
4.part-00000-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-298-1-c000.csv.gz
5.part-00001-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-299-1-c000.csv.gz
6.part-00002-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-300-1-c000.csv.gz
7.part-00003-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-301-1-c000.csv.gz spark.option("header",True).csv(filePath)

  • A. spark.read.load(filePath)
  • B. spark.read().option("header",True).load(filePath)
  • C. spark.read.format("csv").option("header",True).load(filePath)
  • D. spark.read.format("csv").option("header",True).option("compression","zip").load(filePath)

Answer: C

Explanation:
Explanation
The files in directory filePath are partitions of a DataFrame that have been exported using gzip compression.
Spark automatically recognizes this situation and imports the CSV files as separate partitions into a single DataFrame. It is, however, necessary to specify that Spark should load the file headers in the CSV with the header option, which is set to False by default.

 

NEW QUESTION 60
The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching value in column itemId with a value in column transactionsId of DataFrame transactionsDf. Find the error.
Code block:
itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)

  • A. The merge method should be used instead of join.
  • B. The union method should be used instead of join.
  • C. The join statement is incomplete.
  • D. The join method is inappropriate.
  • E. The join expression is malformed.

Answer: C

Explanation:
Explanation
Correct code block:
itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.transactionId) The join statement is incomplete.
Correct! If you look at the documentation of DataFrame.join() (linked below), you see that the very first argument of join should be the DataFrame that should be joined with. This first argument is missing in the code block.
The join method is inappropriate.
No. By default, DataFrame.join() uses an inner join. This method is appropriate for the scenario described in the question.
The join expression is malformed.
Incorrect. The join expression itemsDf.itemId==transactionsDf.transactionId is correct syntax.
The merge method should be used instead of join.
False. There is no DataFrame.merge() method in PySpark.
The union method should be used instead of join.
Wrong. DataFrame.union() merges rows, but not columns as requested in the question.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation, pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 61
Which of the following describes Spark actions?

  • A. Stage boundaries are commonly established by actions.
  • B. The driver receives data upon request by actions.
  • C. Writing data to disk is the primary purpose of actions.
  • D. Actions are Spark's way of modifying RDDs.
  • E. Actions are Spark's way of exchanging data between executors.

Answer: B

Explanation:
Explanation
The driver receives data upon request by actions.
Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver.
Actions are Spark's way of exchanging data between executors.
No. In Spark, data is exchanged between executors via shuffles.
Writing data to disk is the primary purpose of actions.
No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver.
Actions are Spark's way of modifying RDDs.
Incorrect. Firstly, RDDs are immutable - they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions.
Stage boundaries are commonly established by actions.
Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.

 

NEW QUESTION 62
Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

  • A. transactionsDf.groupBy(col(storeId).avg())
  • B. transactionsDf.groupBy("storeId").agg(avg("value"))
  • C. transactionsDf.groupBy("storeId").agg(average("value"))
  • D. transactionsDf.groupBy("storeId").avg(col("value"))
  • E. transactionsDf.groupBy("value").average()

Answer: B

Explanation:
Explanation
This question tests your knowledge about how to use the groupBy and agg pattern in Spark. Using the documentation, you can find out that there is no average() method in pyspark.sql.functions.
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 63
Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from DataFrame transactionsDf and column attributes from DataFrame itemsDf?

  • A. transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)
  • B. 1.transactionsDf.createOrReplaceTempView('transactionsDf')
    2.itemsDf.createOrReplaceTempView('itemsDf')
    3.
    4.statement = """
    5.SELECT * FROM transactionsDf
    6.INNER JOIN itemsDf
    7.ON transactionsDf.productId==itemsDf.itemId
    8."""
    9.spark.sql(statement).drop("value", "storeId", "attributes")
  • C. 1.transactionsDf.createOrReplaceTempView('transactionsDf')
    2.itemsDf.createOrReplaceTempView('itemsDf')
    3.
    4.spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")
  • D. transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"),
    "transactionsDf.productId==itemsDf.itemId")
  • E. 1.transactionsDf \
    2. .drop(col('value'), col('storeId')) \
    3. .join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

Answer: B

Explanation:
Explanation
This question offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to understand some SQL syntax to get to the correct answer here.
transactionsDf.createOrReplaceTempView('transactionsDf')
itemsDf.createOrReplaceTempView('itemsDf')
statement = """
SELECT * FROM transactionsDf
INNER JOIN itemsDf
ON transactionsDf.productId==itemsDf.itemId
"""
spark.sql(statement).drop("value", "storeId", "attributes")
Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows you to express strings as multiple lines.
transactionsDf \
drop(col('value'), col('storeId')) \
join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))
No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead.
transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"),
"transactionsDf.productId==itemsDf.itemId")
Incorrect - Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression.
This would work if it would not be a string.
transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId) Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.
transactionsDf.createOrReplaceTempView('transactionsDf')
itemsDf.createOrReplaceTempView('itemsDf')
spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes") No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 64
Which of the following code blocks returns a DataFrame that matches the multi-column DataFrame itemsDf, except that integer column itemId has been converted into a string column?

  • A. itemsDf.withColumn("itemId", col("itemId").convert("string"))
  • B. itemsDf.withColumn("itemId", convert("itemId", "string"))
  • C. itemsDf.withColumn("itemId", col("itemId").cast("string"))
    (Correct)
  • D. spark.cast(itemsDf, "itemId", "string")
  • E. itemsDf.select(cast("itemId", "string"))

Answer: C

Explanation:
Explanation
itemsDf.withColumn("itemId", col("itemId").cast("string"))
Correct. You can convert the data type of a column using the cast method of the Column class. Also note that you will have to use the withColumn method on itemsDf for replacing the existing itemId column with the new version that contains strings.
itemsDf.withColumn("itemId", col("itemId").convert("string"))
Incorrect. The Column object that col("itemId") returns does not have a convert method.
itemsDf.withColumn("itemId", convert("itemId", "string"))
Wrong. Spark's spark.sql.functions module does not have a convert method. The question is trying to mislead you by using the word "converted". Type conversion is also called "type casting". This may help you remember to look for a cast method instead of a convert method (see correct answer).
itemsDf.select(astype("itemId", "string"))
False. While astype is a method of Column (and an alias of Column.cast), it is not a method of pyspark.sql.functions (what the code block implies). In addition, the question asks to return a full DataFrame that matches the multi-column DataFrame itemsDf. Selecting just one column from itemsDf as in the code block would just return a single-column DataFrame.
spark.cast(itemsDf, "itemId", "string")
No, the Spark session (called by spark) does not have a cast method. You can find a list of all methods available for the Spark session linked in the documentation below.
More info:
- pyspark.sql.Column.cast - PySpark 3.1.2 documentation
- pyspark.sql.Column.astype - PySpark 3.1.2 documentation
- pyspark.sql.SparkSession - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 65
Which of the following statements about lazy evaluation is incorrect?

  • A. Execution is triggered by transformations.
  • B. Lineages allow Spark to coalesce transformations into stages
  • C. Predicate pushdown is a feature resulting from lazy evaluation.
  • D. Spark will fail a job only during execution, but not during definition.
  • E. Accumulators do not change the lazy evaluation model of Spark.

Answer: A

Explanation:
Explanation
Execution is triggered by transformations.
Correct. Execution is triggered by actions only, not by transformations.
Lineages allow Spark to coalesce transformations into stages.
Incorrect. In Spark, lineage means a recording of transformations. This lineage enables lazy evaluation in Spark.
Predicate pushdown is a feature resulting from lazy evaluation.
Wrong. Predicate pushdown means that, for example, Spark will execute filters as early in the process as possible so that it deals with the least possible amount of data in subsequent transformations, resulting in a performance improvements.
Accumulators do not change the lazy evaluation model of Spark.
Incorrect. In Spark, accumulators are only updated when the query that refers to the is actually executed. In other words, they are not updated if the query is not (yet) executed due to lazy evaluation.
Spark will fail a job only during execution, but not during definition.
Wrong. During definition, due to lazy evaluation, the job is not executed and thus certain errors, for example reading from a non-existing file, cannot be caught. To be caught, the job needs to be executed, for example through an action.

 

NEW QUESTION 66
Which of the following describes the characteristics of accumulators?

  • A. If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator.
  • B. Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module.
  • C. Accumulators are used to pass around lookup tables across the cluster.
  • D. Accumulators are immutable.
  • E. All accumulators used in a Spark application are listed in the Spark UI.

Answer: A

Explanation:
Explanation
If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator.
Correct, when Spark tries to rerun a failed action that includes an accumulator, it will only update the accumulator if the action succeeded.
Accumulators are immutable.
No. Although accumulators behave like write-only variables towards the executors and can only be read by the driver, they are not immutable.
All accumulators used in a Spark application are listed in the Spark UI.
Incorrect. For scala, only named, but not unnamed, accumulators are listed in the Spark UI. For pySpark, no accumulators are listed in the Spark UI - this feature is not yet implemented.
Accumulators are used to pass around lookup tables across the cluster.
Wrong - this is what broadcast variables do.
Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module.
Wrong, accumulators are instantiated via the accumulator(n) method of the sparkContext, for example: counter
= spark.sparkContext.accumulator(0).
More info: python - In Spark, RDDs are immutable, then how Accumulators are implemented? - Stack Overflow, apache spark - When are accumulators truly reliable? - Stack Overflow, Spark - The Definitive Guide, Chapter 14

 

NEW QUESTION 67
Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

  • A. itemsDf.cache()
  • B. itemsDf.persist(StorageLevel.MEMORY_ONLY)
  • C. itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
  • D. itemsDf.store()
  • E. itemsDf.write.option('destination', 'memory').save()

Answer: A

Explanation:
Explanation
The key to solving this question is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.
If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 68
Which of the following code blocks returns a DataFrame that is an inner join of DataFrame itemsDf and DataFrame transactionsDf, on columns itemId and productId, respectively and in which every itemId just appears once?

  • A. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId, how="inner").distinct(["itemId"])
  • B. itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId").distinct("itemId")
  • C. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates("itemId")
  • D. itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId", how="inner").dropDuplicates(["itemId"])
  • E. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates(["itemId"])

Answer: E

Explanation:
Explanation
Filtering out distinct rows based on columns is achieved with the dropDuplicates method, not the distinct method which does not take any arguments.
The second argument of the join() method only accepts strings if they are column names. The SQL-like statement "itemsDf.itemId==transactionsDf.productId" is therefore invalid.
In addition, it is not necessary to specify how="inner", since the default join type for the join command is already inner.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 69
The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

  • A. 1. withColumn
    2. "transactionDateForm"
    3. "transactionDate"
    4. "MM d (EEE)"
  • B. 1. withColumnRenamed
    2. "transactionDate"
    3. "transactionDateForm"
    4. "MM d (EEE)"
  • C. 1. withColumn
    2. "transactionDateForm"
    3. "MMM d (EEEE)"
    4. "transactionDate"
  • D. 1. withColumn
    2. "transactionDateForm"
    3. "transactionDate"
    4. "MMM d (EEEE)"
  • E. 1. select
    2. "transactionDate"
    3. "transactionDateForm"
    4. "MMM d (EEEE)"

Answer: D

Explanation:
Explanation
Correct code block:
transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)")) The question specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be used for this purpose, if all existing columns are selected and a new one is added.
DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.
Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column.
The final difficulty is the date format. The question indicates that the date format Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation: Apr for April. And MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.
More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 70
......

Associate-Developer-Apache-Spark Test Engine files, Associate-Developer-Apache-Spark Dumps PDF: https://www.examsreviews.com/Associate-Developer-Apache-Spark-pass4sure-exam-review.html

Latest Databricks Associate-Developer-Apache-Spark PDF and Dumps (2023) Free Exam Questions Answers: https://drive.google.com/open?id=1ICnaqynoNjvPXIVSKwPpgV5RI5zDS--k