Databricks Databricks-Machine-Learning-Associate Free Certification Exam Questions Answer Dec 2025 update

Question # 1

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:

Which of the following lines of code can be used to complete the code block to successfully complete the task?

predict(*spark_df.columns)

mapInPandas(predict)

predict(Iterator(spark_df))

mapInPandas(predict(spark_df.columns))

predict(spark_df.columns)

Question # 2

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Logistic regression

Singular value decomposition

Iterative optimization

Least-squares method

Question # 3

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?

Leave-one-out encoding

Target encoding

One-hot encoding

Categorical

String indexing

Question # 4

Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?

Random Search

Halving Random Search

Tree of Parzen Estimators

Grid Search

Question # 5

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.

Which of the following describes why?

Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

Gradient boosting requires access to all data at once which cannot happen during parallelization.

Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

Explanation:

Gradient boosting is fundamentally an iterative algorithm where each new tree is built based on the errors of the previous ones. This sequential dependency makes it difficult to parallelize the training of trees in gradient boosting, as each step relies on the results from the preceding step. Parallelization in this context would undermine the core methodology of the algorithm, which depends on sequentially improving the model'sperformance with each iteration.References:

Machine Learning Algorithms (Challenges with Parallelizing Gradient Boosting).

Gradient boosting is an ensemble learning technique that builds models in a sequential manner. Each new model corrects the errors made by the previous ones. This sequential dependency means that each iteration requires the results of the previous iteration to make corrections. Here is a step-by-step explanation of why this makes parallelization challenging:

Sequential Nature: Gradient boosting builds one tree at a time. Each tree is trained to correct the residual errors of the previous trees. This requires the model to complete one iteration before starting the next.

Dependence on Previous Iterations: The gradient calculation at each step depends on the predictions made by the previous models. Therefore, the model must wait until the previous tree has been fully trained and evaluated before starting to train the next tree.

Difficulty in Parallelization: Because of this dependency, it is challenging to parallelize the training process. Unlike algorithms that process data independently in each step (e.g., random forests), gradient boosting cannot easily distribute the work across multiple processors or cores for simultaneous execution.

This iterative and dependent nature of the gradient boosting process makes it difficult to parallelize effectively.

References

Gradient Boosting Machine Learning Algorithm

Understanding Gradient Boosting Machines

Question # 6

A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.

Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

PySpark DataFrame API

pandas API on Spark

Spark SQL

Feature Store

Question # 7

A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.

Which of the following suggestions should the team include in their guidelines?

The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.

The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.

The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.

The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.

Question # 8

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.

The Spark DataFrametrain_dfhas the following schema:

The machine learning engineer shares the following code block:

Which of the following changes does the machine learning engineer need to make to complete the task?

They need to call the transform method on train df

They need to convert the features column to be a vector

They do not need to make any changes

They need to utilize a Pipeline to fit the model

They need to split thefeaturescolumn out into one column for each feature

Question # 9

Which of the following machine learning algorithms typically uses bagging?

Gradient boosted trees

K-means

Random forest

Linear regression

Decision tree

Question # 10

A machine learning engineer is trying to scale a machine learning pipeline by distributing its single-node model tuning process. After broadcasting the entire training data onto each core, each core in the cluster can train one model at a time. Because the tuning process is still running slowly, the engineer wants to increase the level of parallelism from 4 cores to 8 cores to speed up the tuning process. Unfortunately, the total memory in the cluster cannot be increased.

In which of the following scenarios will increasing the level of parallelism from 4 to 8 speed up the tuning process?

When the tuning process in randomized

When the entire data can fit on each core

When the model is unable to be parallelized

When the data is particularly long in shape

When the data is particularly wide in shape

New Year Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sntaclus

Free Practice Questions for Databricks Databricks-Machine-Learning-Associate Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: