-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implement Parquet filter pushdown via new filter pushdown APIs #15769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
db6afbd
bf06d57
a1d3441
34723b0
2efd98f
7a45c2c
8bf42e1
7dd399e
32fed34
eb1fce3
1894c1d
3f22a76
3fde445
f2bec87
8c1b98c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,6 +54,7 @@ mod tests { | |
use datafusion_datasource::file_scan_config::FileScanConfigBuilder; | ||
use datafusion_datasource::source::DataSourceExec; | ||
|
||
use datafusion_datasource::file::FileSource; | ||
use datafusion_datasource::{FileRange, PartitionedFile}; | ||
use datafusion_datasource_parquet::source::ParquetSource; | ||
use datafusion_datasource_parquet::{ | ||
|
@@ -139,7 +140,7 @@ mod tests { | |
self.round_trip(batches).await.batches | ||
} | ||
|
||
fn build_file_source(&self, file_schema: SchemaRef) -> Arc<ParquetSource> { | ||
fn build_file_source(&self, file_schema: SchemaRef) -> Arc<dyn FileSource> { | ||
// set up predicate (this is normally done by a layer higher up) | ||
let predicate = self | ||
.predicate | ||
|
@@ -148,7 +149,7 @@ mod tests { | |
|
||
let mut source = ParquetSource::default(); | ||
if let Some(predicate) = predicate { | ||
source = source.with_predicate(Arc::clone(&file_schema), predicate); | ||
source = source.with_predicate(predicate); | ||
Comment on lines
-151
to
+152
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seemed like an easy win since I was able to just change this so that the schema is always passed in by the |
||
} | ||
|
||
if self.pushdown_predicate { | ||
|
@@ -161,14 +162,14 @@ mod tests { | |
source = source.with_enable_page_index(true); | ||
} | ||
|
||
Arc::new(source) | ||
source.with_schema(Arc::clone(&file_schema)) | ||
} | ||
|
||
fn build_parquet_exec( | ||
&self, | ||
file_schema: SchemaRef, | ||
file_group: FileGroup, | ||
source: Arc<ParquetSource>, | ||
source: Arc<dyn FileSource>, | ||
) -> Arc<DataSourceExec> { | ||
let base_config = FileScanConfigBuilder::new( | ||
ObjectStoreUrl::local_filesystem(), | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,7 +59,6 @@ pub use metrics::ParquetFileMetrics; | |
pub use page_filter::PagePruningAccessPlanFilter; | ||
pub use reader::{DefaultParquetFileReaderFactory, ParquetFileReaderFactory}; | ||
pub use row_filter::build_row_filter; | ||
pub use row_filter::can_expr_be_pushed_down_with_schemas; | ||
pub use row_group_filter::RowGroupAccessPlanFilter; | ||
use source::ParquetSource; | ||
pub use writer::plan_to_parquet; | ||
|
@@ -223,8 +222,7 @@ impl ParquetExecBuilder { | |
} = self; | ||
let mut parquet = ParquetSource::new(table_parquet_options); | ||
if let Some(predicate) = predicate.clone() { | ||
parquet = parquet | ||
.with_predicate(Arc::clone(&file_scan_config.file_schema), predicate); | ||
parquet = parquet.with_predicate(predicate); | ||
} | ||
if let Some(metadata_size_hint) = metadata_size_hint { | ||
parquet = parquet.with_metadata_size_hint(metadata_size_hint) | ||
|
@@ -244,7 +242,7 @@ impl ParquetExecBuilder { | |
inner: DataSourceExec::new(Arc::new(base_config.clone())), | ||
base_config, | ||
predicate, | ||
pruning_predicate: parquet.pruning_predicate, | ||
pruning_predicate: None, // for backwards compat since `ParquetExec` is only for backwards compat anyway | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Open to other suggestions (i.e. removing it). I felt like this minimizes breakage for folks still using |
||
schema_adapter_factory: parquet.schema_adapter_factory, | ||
parquet_file_reader_factory: parquet.parquet_file_reader_factory, | ||
table_parquet_options: parquet.table_parquet_options, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this PR is that this moves from being something specialized that
ListingTable
does to anything that works for any TableProvider / they don't need to do anything special! The checks for compatibility also happen all within the parquet data source machinery, instead of leaking implementations viasupports_filters_pushdown
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one question: aren't we expecting/preparing for, people to use ListingTable if they read Parquet files? Are we eventually planning to remove all format-specific handlings? Or this is a case only for filter pushdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, why don't we fully remove
supports_filters_pushdown()
API at allThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think many users of DataFusion (based on our usage, talks I've seen and examples we have) use custom
TableProvider
implementations.I would keep
supports_filters_pushdown
so thatTableProviders
can doExact
pruning of filters, e.g. using partition columns.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can justify implementing other TableProviders for Parquet, but still I cannot understand why we need to degrade the capabilities of our ListingTable. Is't it always better pruning/simplifying things at the higher levels as possible?