-
Notifications
You must be signed in to change notification settings - Fork 60
feat: Add range partitioning support #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating this PR. I'll review this later. Before review this PR, I have a question. Could you check my comment?
README.md
Outdated
range_partitioning: | ||
field: customer_id | ||
range: | ||
start: '1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this part use string
instead of integer
? As far as I know, range partition uses a number. And we can check the range is valid (start < end
) if we use integer
.
What do you think of this configuration layout?
(Do we need a range
block?).
range_partitioning:
field: customer_id # document uses `column` but this plugin uses `field`. [1]
start: 1
end: 1000
interval: 10
# [1] https://cloud.google.com/bigquery/docs/creating-partitioned-tables#create_an_integer-range_partitioned_table
(This is just my opinion, I want to ask co-maintainers this comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment! I followed the api documentation. If you prefer integer, I'll change this!
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#RangePartitioning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a range block?
I fixed it with 49d2c36.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, @kitagry. Thank you for waiting.
Could you use this layout? (Sorry, we decided to use the original design (except using integer
instead of string
in range values.)
range_partitioning:
field: customer_id
range:
start: 1 # integer not string.
end: 99999 # integer not string.
interval: 1 # integer not string.
and Could you check the range start
+ interval
< end
?
If you have any concern, please let me know.
I referenced the time_partition
configurations.
BigQuery API use
{
"type": string,
"expirationMs": string,
"field": string,
"requirePartitionFilter": boolean
}
embulk configuration
type: bigquery
table: table_name$20160929
time_partitioning:
type: DAY
expiration_ms: 259200000 # integer not strong, use sake case
We discussed this using this design document. (Written in Japanese)
After modification, I'll check the partition feature.
@kitagry Thank you for more work on this PR. I'll review this PR please wait. I'm not the original plugin developer. So, I need to investigate the configuration rule. If this plugin is based on the API settings, It would be better to use the original |
49d2c36
to
f8d039f
Compare
Hi @hiroyuki-sato , I changed range-partitioning fileds to be integer in f8d039f |
f8d039f
to
5398f69
Compare
@kitagry Thanks. I will test this PR later. Please wait for a while. |
@kitagry Thank you for waiting. I'm so sorry, but I can't get the review time. I'll review this in April. |
@hiroyuki-sato Hi. Could you review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, @kitagry
Sorry for the late review. I was able to confirm that the main part of this PR works. I was able to create a partition table using this dummy data.
I added the id
field using the following command.
I have some comments. Could you take a look?
nkf -w input.csv | awk -F, '{ print NR-1","$0 }' > curry.csv
in:
type: file
path_prefix: ./curry.csv
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: name, type: string}
- {name: kana, type: string}
- {name: address, type: string}
- {name: sex, type: string}
- {name: age, type: long}
- {name: birthday, type: timestamp, format: '%Y/%m/%d'}
- {name: marriage, type: string}
- {name: prefecture, type: string}
- {name: mobile, type: string}
- {name: carrier, type: string}
- {name: curry, type: string}
out:
type: bigquery
mode: append
project: project
dataset: dataset
auth_method: service_account
json_keyfile: key.json
auto_create_table: true
table: part_test
location: asia-northeast1
range_partitioning:
field: id
range:
start: 1
end: 5001
interval: 500
Thank you for the review. I fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kitagry Thank you for your work. I left two comments. Could you tell me your opinion?
lib/embulk/output/bigquery.rb
Outdated
@@ -231,10 +232,43 @@ def self.configure(config, schema, task_count) | |||
unless task['time_partitioning']['type'] | |||
raise ConfigError.new "`time_partitioning` must have `type` key" | |||
end | |||
elsif Helper.has_partition_decorator?(task['table']) | |||
# If user specify range_partitioning, it should be used as is | |||
elsif Helper.has_partition_decorator?(task['table']) && task['range_partitioning'].nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on this part? IIUC Helper.has_partition_decorator?(task['table'])
returns true if the table name contains $
. why you add task['range_partitioning'].nil?
?
I can't understand If user specify range_partitioning, it should be used as is
means.
In my opinion, these checks are preferable.
# can't use two partitions config at the same time.
if task['time_partitioning'] && task['range_partitioning']
raise ConfigError.new ...
end
# partition decrator doesn't support range_partition (if needed)
if Helper.has_partition_decorator?(task['table']) && task['range_partitioning']
raise ConfigError.new ...
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. I think that the following code would be preferable.
if Helper.has_partition_decorator?(task['table'])
task['time_partitioning'] = {'type' => 'DAY'}
end
if task['time_partitioning'] && task['range_partitioning']
raise ConfigError.new ...
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the error message would be confusing. I fix like e9cc945. Could you check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is just a double check.) the error message
means If user specify range_partitioning, it should be used as is
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I use like following. Users who use partition decorator may be confused, because he don't use time_partitioning.
if Helper.has_partition_decorator?(task['table'])
task['time_partitioning'] = {'type' => 'DAY'}
end
if task['time_partitioning'] && task['range_partitioning']
raise ConfigError.new "`time_partitioning` and `range_partitioning` cannot be used at the same time"
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean if the user uses the following configuration, the user may confuse the storage message. correct?
If so, your proposed PR seems good.
table: part_test$200105
range_partitioning:
field: id
range:
start: 1
end: 5001
interval: 500
1e0f108
to
e9cc945
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kitagry Thank you for fixing the PR. Basically, LGTM. Could you check two minor comments?
@kitagry Thank you for your work. LGTM. I'll re-check this change and request co-maintainer review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kitagry, could you fix the test and check my comments? I'll approve this PR after fixing the test. (I'll request a check co-maintainer)
Thank you for the comment. I'll see tomorrow 🙏 |
Co-authored-by: Hiroyuki Sato <hiroysato@gmail.com>
de9c8be
to
068acba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kitagry LGTM👍
@joker1007 Could you take a look this PR?
This pull request introduces support for range partitioning in BigQuery.
I checked this feature with
example/config_replace_field_range_partitioned_table.yml