Skip to content

[SPARK-51834][SQL] Support end-to-end table constraint management #50631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Support end-to-end table constraint management:

  • Create a DSV2 table with constraints
  • Replace a DSV2 table with constraints
  • ALTER a DSV2 table to add a new constraint
  • ALTER a DSV2 table to drop a constraint

Why are the changes needed?

Allow users to define and modify table constraints in connectors that support them.

Does this PR introduce any user-facing change?

No, it is for DSV2 framework.

How was this patch tested?

New UT

Was this patch authored or co-authored using generative AI tooling?

No

@gengliangwang
Copy link
Member Author

cc @aokolnychyi

val constraint = tableConstraint.toV2Constraint(isCreateTable = false)
val validatedTableVersion = table match {
case t: ResolvedTable if constraint.enforced() =>
t.table.currentVersion()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a follow-up https://issues.apache.org/jira/browse/SPARK-51835 for testing the table version

"message" : [
"The check constraint `<checkCondition>` is non-deterministic. Check constraints must only contain deterministic expressions."
],
"sqlState" : "42621"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error code seems consistent with DB2 and what we use for generated columns, +1.

@@ -18,11 +18,12 @@
package org.apache.spark.sql.catalyst.analysis

import org.apache.spark.SparkThrowable
import org.apache.spark.sql.catalyst.expressions.{Expression, Literal}
import org.apache.spark.sql.catalyst.expressions._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the agreement in the community on wildcard imports? Are they permitted after a given number of elements are imported directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://github.com/databricks/scala-style-guide?tab=readme-ov-file#imports,
"Avoid using wildcard imports, unless you are importing more than 6 entities"

Some(LocalRelation(attributeList))
}

private def analyzeConstraints(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other way to do this? Can we restructure the plan so that the analyzer naturally resolves these expressions? I like that we pivoted to DefaultValueExpression for default values, rather than using a custom analyzer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something similar to what @cloud-fan did for OverwriteByExpression in SPARK-33412?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry is that we added DefaultValueExpression to eventually get rid of the custom analyzer and optimizer for default values. It would be great not to add more dependencies on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some differences here. In a create table statement:

  • column default value and option CANNOT reference columns
  • constraint CAN reference columns

Using a default analyzer with a dummy relation is simple, and it can include all other analysis batches other than the main Resolution batch.

val validateStatus = if (isCreateTable) {
Constraint.ValidationStatus.UNVALIDATED
} else {
Constraint.ValidationStatus.VALID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that we always validate existing data in ALTER?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for check constraint

@@ -112,6 +117,27 @@ case class CheckConstraint(
with TableConstraint {
// scalastyle:on line.size.limit

def toV2Constraint(isCreateTable: Boolean): Constraint = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the input param should be related to the validation status, rather than to whether it is create or alter. For instance, we can make validation optional in ALTER.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, how about let's make all the validate status as UNVALIDATED in this PR? Once we support enforcing check constraint, we can have more discussions on this one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

case _ =>
null
}
Seq(TableChange.addConstraint(constraint, validatedTableVersion))
Copy link
Contributor

@aokolnychyi aokolnychyi Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK constraints must optionally validate existing data in ALTER.
Am I right this PR doesn't have this? What would be our plan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must optionally validate

Make sense. Do you mean CHECK ... NOT ENFOCED?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENFORCED/NOT ENFORCED impacts subsequent writes. I was referring to ALTER TABLE ... ADD CONSTRAINT that must scan the existing data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants