[SPARK-51834][SQL] Support end-to-end table constraint management #50631

gengliangwang · 2025-04-18T00:12:14Z

What changes were proposed in this pull request?

Support end-to-end table constraint management:

Create a DSV2 table with constraints
Replace a DSV2 table with constraints
ALTER a DSV2 table to add a new constraint
ALTER a DSV2 table to drop a constraint

Why are the changes needed?

Allow users to define and modify table constraints in connectors that support them.

Does this PR introduce any user-facing change?

No, it is for DSV2 framework.

How was this patch tested?

New UT

Was this patch authored or co-authored using generative AI tooling?

No

gengliangwang · 2025-04-18T00:12:41Z

cc @aokolnychyi

gengliangwang · 2025-04-18T00:39:27Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala

+    val constraint = tableConstraint.toV2Constraint(isCreateTable = false)
+    val validatedTableVersion = table match {
+      case t: ResolvedTable if constraint.enforced() =>
+        t.table.currentVersion()


Created a follow-up https://issues.apache.org/jira/browse/SPARK-51835 for testing the table version

aokolnychyi · 2025-04-22T15:49:29Z

common/utils/src/main/resources/error/error-conditions.json

+    "message" : [
+      "The check constraint `<checkCondition>` is non-deterministic. Check constraints must only contain deterministic expressions."
+    ],
+    "sqlState" : "42621"


The error code seems consistent with DB2 and what we use for generated columns, +1.

aokolnychyi · 2025-04-22T15:54:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTableSpec.scala

@@ -18,11 +18,12 @@
 package org.apache.spark.sql.catalyst.analysis

 import org.apache.spark.SparkThrowable
-import org.apache.spark.sql.catalyst.expressions.{Expression, Literal}
+import org.apache.spark.sql.catalyst.expressions._


What is the agreement in the community on wildcard imports? Are they permitted after a given number of elements are imported directly?

As per https://github.com/databricks/scala-style-guide?tab=readme-ov-file#imports,
"Avoid using wildcard imports, unless you are importing more than 6 entities"

aokolnychyi · 2025-04-22T16:22:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTableSpec.scala

+    Some(LocalRelation(attributeList))
+  }
+
+  private def analyzeConstraints(


Is there any other way to do this? Can we restructure the plan so that the analyzer naturally resolves these expressions? I like that we pivoted to DefaultValueExpression for default values, rather than using a custom analyzer.

I can't think of any other way.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala#L335
This is similar to what we did for column default.

Let me think a bit.

Can we do something similar to what @cloud-fan did for OverwriteByExpression in SPARK-33412?

My worry is that we added DefaultValueExpression to eventually get rid of the custom analyzer and optimizer for default values. It would be great not to add more dependencies on it.

There are some differences here. In a create table statement:

column default value and option CANNOT reference columns

constraint CAN reference columns

Using a default analyzer with a dummy relation is simple, and it can include all other analysis batches other than the main Resolution batch.

aokolnychyi · 2025-04-22T16:25:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/constraints.scala

+    val validateStatus = if (isCreateTable) {
+      Constraint.ValidationStatus.UNVALIDATED
+    } else {
+      Constraint.ValidationStatus.VALID


Is the idea here that we always validate existing data in ALTER?

Yes for check constraint

aokolnychyi · 2025-04-22T16:26:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/constraints.scala

@@ -112,6 +117,27 @@ case class CheckConstraint(
  with TableConstraint {
 // scalastyle:on line.size.limit

+  def toV2Constraint(isCreateTable: Boolean): Constraint = {


I wonder if the input param should be related to the validation status, rather than to whether it is create or alter. For instance, we can make validation optional in ALTER.

ok, how about let's make all the validate status as UNVALIDATED in this PR? Once we support enforcing check constraint, we can have more discussions on this one

Makes sense to me.

aokolnychyi · 2025-04-22T16:30:33Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala

+      case _ =>
+        null
+    }
+    Seq(TableChange.addConstraint(constraint, validatedTableVersion))


CHECK constraints must optionally validate existing data in ALTER.
Am I right this PR doesn't have this? What would be our plan?

must optionally validate

Make sense. Do you mean CHECK ... NOT ENFOCED?

ENFORCED/NOT ENFORCED impacts subsequent writes. I was referring to ALTER TABLE ... ADD CONSTRAINT that must scan the existing data.

gengliangwang added 6 commits April 17, 2025 08:41

add asConstraint

f73411b

support create/replace table with constraint

259de54

implment alter table commands

3ab17fc

add tests

2fac017

add tests

76508df

rename

a6507e0

github-actions bot added the SQL label Apr 18, 2025

gengliangwang requested review from viirya and cloud-fan April 18, 2025 00:12

add replace table test cases

b78c4f0

gengliangwang commented Apr 18, 2025

View reviewed changes

fix formatting

d54a64f

aokolnychyi reviewed Apr 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51834][SQL] Support end-to-end table constraint management #50631

[SPARK-51834][SQL] Support end-to-end table constraint management #50631

gengliangwang commented Apr 18, 2025

gengliangwang commented Apr 18, 2025

gengliangwang Apr 18, 2025

aokolnychyi Apr 22, 2025

aokolnychyi Apr 22, 2025

gengliangwang Apr 22, 2025

aokolnychyi Apr 22, 2025

gengliangwang Apr 22, 2025

aokolnychyi Apr 22, 2025

aokolnychyi Apr 22, 2025

aokolnychyi Apr 22, 2025

gengliangwang Apr 22, 2025

aokolnychyi Apr 22, 2025

gengliangwang Apr 22, 2025

aokolnychyi Apr 22, 2025

gengliangwang Apr 22, 2025

aokolnychyi Apr 22, 2025

aokolnychyi Apr 22, 2025 •

edited

Loading

gengliangwang Apr 22, 2025

aokolnychyi Apr 22, 2025

[SPARK-51834][SQL] Support end-to-end table constraint management #50631

Are you sure you want to change the base?

[SPARK-51834][SQL] Support end-to-end table constraint management #50631

Conversation

gengliangwang commented Apr 18, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

gengliangwang commented Apr 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Apr 22, 2025 •

edited

Loading