Skip to content

feat: Add range partitioning support #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kitagry
Copy link

@kitagry kitagry commented Feb 15, 2025

This pull request introduces support for range partitioning in BigQuery.

I checked this feature with example/config_replace_field_range_partitioned_table.yml

image

@hiroyuki-sato hiroyuki-sato self-requested a review February 15, 2025 15:47
Copy link
Member

@hiroyuki-sato hiroyuki-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this PR. I'll review this later. Before review this PR, I have a question. Could you check my comment?

README.md Outdated
range_partitioning:
field: customer_id
range:
start: '1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this part use string instead of integer? As far as I know, range partition uses a number. And we can check the range is valid (start < end) if we use integer.

What do you think of this configuration layout?
(Do we need a range block?).

range_partitioning:
  field:  customer_id # document uses `column` but this plugin uses `field`. [1] 
  start: 1
  end: 1000
  interval: 10
  # [1] https://cloud.google.com/bigquery/docs/creating-partitioned-tables#create_an_integer-range_partitioned_table

(This is just my opinion, I want to ask co-maintainers this comment.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment! I followed the api documentation. If you prefer integer, I'll change this!

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#RangePartitioning

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a range block?

I fixed it with 49d2c36.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, @kitagry. Thank you for waiting.

Could you use this layout? (Sorry, we decided to use the original design (except using integer instead of string in range values.)

range_partitioning:
  field: customer_id
  range:
    start: 1    # integer not string.
    end: 99999  # integer not string.
    interval: 1 # integer not string.

and Could you check the range start + interval < end?
If you have any concern, please let me know.

I referenced the time_partition configurations.

BigQuery API use

{
  "type": string,
  "expirationMs": string,
  "field": string,
  "requirePartitionFilter": boolean
}

embulk configuration

  type: bigquery
  table: table_name$20160929
  time_partitioning:
    type: DAY
    expiration_ms: 259200000 # integer not strong, use sake case

We discussed this using this design document. (Written in Japanese)

After modification, I'll check the partition feature.

@hiroyuki-sato
Copy link
Member

@kitagry Thank you for more work on this PR. I'll review this PR please wait. I'm not the original plugin developer. So, I need to investigate the configuration rule. If this plugin is based on the API settings, It would be better to use the original field.range.start. I'll talk co-maintainer about this.

@kitagry kitagry force-pushed the add-range-paritioned branch from 49d2c36 to f8d039f Compare March 6, 2025 12:54
@kitagry
Copy link
Author

kitagry commented Mar 6, 2025

Hi @hiroyuki-sato , I changed range-partitioning fileds to be integer in f8d039f

@kitagry kitagry force-pushed the add-range-paritioned branch from f8d039f to 5398f69 Compare March 9, 2025 06:26
@hiroyuki-sato
Copy link
Member

@kitagry Thanks. I will test this PR later. Please wait for a while.

@hiroyuki-sato
Copy link
Member

@kitagry Thank you for waiting. I'm so sorry, but I can't get the review time. I'll review this in April.

@kitagry
Copy link
Author

kitagry commented Apr 22, 2025

@hiroyuki-sato Hi. Could you review this?

Copy link
Member

@hiroyuki-sato hiroyuki-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, @kitagry

Sorry for the late review. I was able to confirm that the main part of this PR works. I was able to create a partition table using this dummy data.
I added the id field using the following command.
I have some comments. Could you take a look?

nkf -w input.csv | awk -F, '{ print NR-1","$0 }' > curry.csv

in:
  type: file
  path_prefix: ./curry.csv
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: name, type: string}
    - {name: kana, type: string}
    - {name: address, type: string}
    - {name: sex, type: string}
    - {name: age, type: long}
    - {name: birthday, type: timestamp, format: '%Y/%m/%d'}
    - {name: marriage, type: string}
    - {name: prefecture, type: string}
    - {name: mobile, type: string}
    - {name: carrier, type: string}
    - {name: curry, type: string}
out:
  type: bigquery
  mode: append
  project: project
  dataset: dataset
  auth_method: service_account
  json_keyfile: key.json
  auto_create_table: true
  table: part_test
  location: asia-northeast1
  range_partitioning:
    field: id
    range:
      start: 1
      end: 5001
      interval: 500

@kitagry
Copy link
Author

kitagry commented Apr 24, 2025

Thank you for the review. I fixed.

Copy link
Member

@hiroyuki-sato hiroyuki-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitagry Thank you for your work. I left two comments. Could you tell me your opinion?

@@ -231,10 +232,43 @@ def self.configure(config, schema, task_count)
unless task['time_partitioning']['type']
raise ConfigError.new "`time_partitioning` must have `type` key"
end
elsif Helper.has_partition_decorator?(task['table'])
# If user specify range_partitioning, it should be used as is
elsif Helper.has_partition_decorator?(task['table']) && task['range_partitioning'].nil?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on this part? IIUC Helper.has_partition_decorator?(task['table']) returns true if the table name contains $. why you add task['range_partitioning'].nil??

I can't understand If user specify range_partitioning, it should be used as is means.

In my opinion, these checks are preferable.

# can't use two partitions config at the same time.
if task['time_partitioning'] && task['range_partitioning']
  raise ConfigError.new ...
end

# partition decrator doesn't support range_partition (if needed)
if Helper.has_partition_decorator?(task['table']) && task['range_partitioning']
  raise ConfigError.new ...
end

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. I think that the following code would be preferable.

if Helper.has_partition_decorator?(task['table'])
  task['time_partitioning'] = {'type' => 'DAY'}
end

if task['time_partitioning'] && task['range_partitioning']
  raise ConfigError.new ...
end

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the error message would be confusing. I fix like e9cc945. Could you check this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is just a double check.) the error message means If user specify range_partitioning, it should be used as is?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I use like following. Users who use partition decorator may be confused, because he don't use time_partitioning.

if Helper.has_partition_decorator?(task['table'])
  task['time_partitioning'] = {'type' => 'DAY'}
end

if task['time_partitioning'] && task['range_partitioning']
  raise ConfigError.new "`time_partitioning` and `range_partitioning` cannot be used at the same time"
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean if the user uses the following configuration, the user may confuse the storage message. correct?
If so, your proposed PR seems good.

  table: part_test$200105
  range_partitioning:
    field: id
    range:
      start: 1
      end: 5001
      interval: 500

@kitagry kitagry force-pushed the add-range-paritioned branch from 1e0f108 to e9cc945 Compare April 26, 2025 05:00
Copy link
Member

@hiroyuki-sato hiroyuki-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitagry Thank you for fixing the PR. Basically, LGTM. Could you check two minor comments?

@hiroyuki-sato
Copy link
Member

@kitagry Thank you for your work. LGTM. I'll re-check this change and request co-maintainer review.

Copy link
Member

@hiroyuki-sato hiroyuki-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitagry, could you fix the test and check my comments? I'll approve this PR after fixing the test. (I'll request a check co-maintainer)

@kitagry
Copy link
Author

kitagry commented Apr 30, 2025

Thank you for the comment. I'll see tomorrow 🙏

Co-authored-by: Hiroyuki Sato <hiroysato@gmail.com>
@kitagry kitagry force-pushed the add-range-paritioned branch from de9c8be to 068acba Compare May 1, 2025 14:21
Copy link
Member

@hiroyuki-sato hiroyuki-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitagry LGTM👍

@joker1007 Could you take a look this PR?

@hiroyuki-sato hiroyuki-sato requested a review from joker1007 May 2, 2025 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants