Skip to content

Commit eed0e84

Browse files
committed
Added DuckDB to Parquet lesson.
1 parent 7fd0a80 commit eed0e84

File tree

3 files changed

+163
-23
lines changed

3 files changed

+163
-23
lines changed

images/duckdb-logo.png

6.58 KB
Loading

images/parquet-logo.png

21.5 KB
Loading

sections/parquet-arrow.qmd

+163-23
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,68 @@
11
---
2-
title: "Parquet and Arrow"
2+
title: "Parquet, DuckDB, and Arrow"
33
---
44

55
## Learning Objectives
66

77
- The difference between column major and row major data
88
- Speed advantages to columnnar data storage
9+
- Parquet is a highly efficient columnar data store
10+
- Using DuckDB as an in-memory query engine
911
- How `arrow` enables faster processing
1012

1113
## Introduction
1214

13-
Paralleization is great, and can greatly help you in working with large data. However, it might not help you with every processing problem. Like we talked about with Dask, sometimes your data are too large to be read into memory, or you have I/O limitations. Parquet and `pyarrow` are newer, powerful tools that are designed to help overcome some of these problems. `pyarrow` and Parquet are newer technologies, and are a bit on the 'bleeding edge', but there is a lot of excitement about the possibility these tools provide.
15+
Parallelization is great, and can greatly help you in working with large data. However, it might not help you with every processing problem. Like we talked about with Dask, sometimes your data are too large to be read into memory, or you have I/O limitations. For multidimensional arrary data, `xarray` and `Dask` are pretty amazing. But what about tabular data? Some types of data are not well represented as arrays, and some arrays would not be efficiently represented as tables (see [Abernathy's 2025 article on tensors versus tables](https://earthmover.io/blog/tensors-vs-tables) for a well-thought perspective on this).
16+
17+
Parquet, DuckDB, and Arrow are powerful tools that are designed to help overcome some of these scaling problems with tabular data. They are newer technologies and are advancing quickly, but there is a lot of excitement about the possibility these tools provide.
18+
19+
::: {.callout-note}
20+
### POSIX file handling
1421

1522
Before jumping into those tools, however, first let's discuss system calls. These are calls that are run by the operating system within their own process. There are several that are relevant to reading and writing data: open, read, write, seek, and close. Open establishes a connection with a file for reading, writing, or both. On open, a file offset points to the beginning of the file. After reading or writing `n` bytes, the offset will move `n` bytes forward to prepare for the next opration. Read will read data from the file into a memory buffer, and write will write data from a memory buffer to a file. Seek is used to change the location of the offset pointer, for either reading or writing purposes. Finally, close closes the connection to the file.
1623

24+
:::
25+
1726
If you've worked with even moderately sized datasets, you may have encounted an "out of memory" error. Memory is where a computer stores the information needed immediately for processes. This is in contrast to storage, which is typically slower to access than memory, but has a much larger capacity. When you `open` a file, you are establishing a connection between your processor and the information in storage. On `read`, the data is read into memory that is then available to your python process, for example.
1827

19-
So what happens if the data you need to read in are larger than your memory? My brand new M1 MacBook Pro has 16 GB of memory, but this would be considered a modestly sized dataset by this courses's standards. There are a number of solutions to this problem, which don't involve just buying a computer with more memory. In this lesson we'll discuss the difference between row major and column major file formats, and how leveraging column major formats can increase memory efficiency. We'll also learn about another python library called `pyarrow`, which has a memory format that allows for "zero copy" read.
28+
So what happens if the data you need to read in are larger than your memory? My brand new M1 MacBook Pro has 16 GB of memory, but this would be considered a modestly sized dataset by this courses's standards. There are a number of solutions to this problem, which don't involve just buying a computer with more memory. In this lesson we'll discuss the difference between row major and column major file formats, and how leveraging column major formats can increase memory efficiency. We'll also learn about other python packages like `duckdb` and `pyarrow`, which has a memory format that allows for "zero copy" read.
2029

2130
## Row major vs column major
2231

23-
The difference between row major and column major is in the ordering of items in the array when they are read into memory.
32+
The difference between row major and column major is in the ordering of items in the data when they are read into memory.
33+
34+
::: {.column-margin}
35+
36+
![Image credit: Wikipedia](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Row_and_column_major_order.svg/500px-Row_and_column_major_order.svg.png)
37+
38+
:::
2439

2540
Take the array:
26-
```
27-
a11 a12 a13
2841

29-
a21 a22 a23
3042
```
43+
[[11, 12, 13],
44+
[21, 22, 23]]
45+
```
46+
47+
In **row-major order**, we would save the order of items in memory with the first row elements in series followed by the second row elements:
3148

32-
This array in a row-major order would be read in as:
49+
`11, 12, 13, 21, 22, 23`
3350

34-
`a11, a12, a13, a21, a22, a23`
51+
The same data stored in **column-major order** would place elements from the same column close together:
3552

36-
You could also read it in column-major order as:
53+
`11, 21, 12, 22, 13, 33`
3754

38-
`a11, a21, a12, a22, a13, a33`
55+
Because of this difference in ordering of the saved data bytes, accessing the data from e.g., the second column is more efficient if the data are in column-major order. This is because the sequence of bytes storing the values in the second column would be continuous, and could be read without reading any data in columns 1 and 3. But reading many complete rows of data would be inefficient in column-major order, so your expected access patterns will determine which performs better.
56+
57+
It turns out that, for many analytical purposes, we often only need access to a small subset of data from one or two columns to perform a computation, and so column-major is often efficient.
3958

4059
By default, C and SAS use row major order for arrays, and column major is used by Fortran, MATLAB, R, and Julia.
4160

42-
Python uses neither, instead representing arrays as lists of lists, though `numpy` uses row-major order.
61+
Python's `numpy` package uses row-major order by default and can be configured to use column-major.
4362

4463
### Row major versus column major files
4564

46-
The same concept can be applied to file formats as the example with arrays above. In row-major file formats, the values (bytes) of each record are read sequentially.
65+
The same concept can be applied to file formats as the example with in-memory arrays above. In row-major file formats, the values (bytes) of each record are stored sequentially.
4766

4867
Name | Location | Age
4968
-----|----------|----
@@ -52,7 +71,7 @@ Mariah | Texas | 21
5271
Allison | Oregon | 57
5372

5473
In the above row major example, data are read in the order:
55-
`John, Washingon, 40 \n Mariah, Texas, 21`.
74+
`John, Washingon, 40\nMariah, Texas, 21\n`.
5675

5776
This means that getting a subset of rows with all the columns would be easy; you can specify to read in only the first X rows (utilizing the seek system call). However, if we are only interested in Name and Location, we would still have to read in all of the rows before discarding the Age column.
5877

@@ -68,19 +87,123 @@ And the read order would first be the names, then the locations, then the age. T
6887

6988
## Parquet
7089

90+
::: {layout-ncol="2"}
91+
7192
Parquet is an open-source binary file format that stores data in a column-major format. The format contains several key components:
7293

94+
![](../images/parquet-logo.png)
95+
96+
:::
97+
7398
- row group
7499
- column
75100
- page
76101
- footer
77102

78-
![](../images/parquet-schematic.png)
103+
![](../images/parquet-schematic.png){.lightbox fig-align="center" width="80%"}
79104

80-
Row groups are blocks of data over a set number of rows that contain data from the same columns. Within each row group, data are organized in column-major format, and within each column are pages that are typically 1MB. The footer of the file contains metadata like the schema, encodings, unique values in each column, etc.
105+
Row groups are blocks of data containin a set number of rows with data from the same columns. Within each row group, data are organized in column-major format, and within each column are pages that are typically a fixed size. The footer of the file contains metadata like the schema, encodings, unique values in each column, etc., which makes scanning metadata very efficient.
81106

82107
The parquet format has many tricks to to increase storage efficiency, and is increasingly being used to handle large datasets.
83108

109+
## Fast access with DuckDB
110+
111+
For the Witharana et al. ice wedge polygon (IWP) dataset ([doi:10.18739/A24F1MK7Q](https://doi.org/10.18739/A24F1MK7Q)), we created a tabular data file of statistics which is 4.6GB in text CSV format, but can be reduced down to 1.6GB by a straight conversion to Parquet:
112+
113+
```
114+
jones@arcticdata.io:iwp_geotiff_low_medium$ du -sh raster_summary.*
115+
4.6G raster_summary.csv
116+
1.6G raster_summary.parquet
117+
```
118+
119+
::: {.column-margin}
120+
121+
![DuckDB](../images/duckdb-logo.png)
122+
123+
:::
124+
125+
But even at 1.6GB, that will take a while to download -- how might we know whether we want to? Easy, use [DuckDB](https://duckdb.org/), an in-memory database that can efficiently read and query columnar data formats like Parquet, and much more!
126+
127+
Because DuckDB can access the metadata in a parquet file, and efficiently make use of parquet's efficient data layouts, we can quickly query even a large, remote dataset without downloading the whole thing. First, let's take a look at the columns in the dataset (a metadata-only query):
128+
129+
```{python}
130+
#| eval: false
131+
import duckdb
132+
iwp_path = 'https://arcticdata.io/data/10.18739/A24F1MK7Q/iwp_geotiff_low_medium/raster_summary.parquet'
133+
iwp = duckdb.read_parquet(iwp_path)
134+
print(iwp.columns)
135+
```
136+
```
137+
['stat', 'bounds', 'min', 'max', 'mean', 'median', 'std', 'var', 'sum', 'path', 'tile', 'z']
138+
```
139+
140+
While that dataset is quite large, our metadata query returned in less than a second. Other types of summary queries that can be constructed with just metadata can be exceedingly fast as well. For example, let's count all of the rows, which we can do using SQL syntax:
141+
142+
```{python}
143+
#| eval: false
144+
duckdb.sql("SELECT count(*) as n FROM iwp;").show()
145+
```
146+
```
147+
┌──────────┐
148+
│ n │
149+
│ int64 │
150+
├──────────┤
151+
│ 18150329 │
152+
└──────────┘
153+
```
154+
155+
So we quickly learn that this table has 18 million rows!
156+
157+
A common operation that can be expensive is to look at the number of distinct values in a column, which, in a row-major data source, would often require reading the entire table. Let's try with DuckDB:
158+
159+
```{python}
160+
#| eval: false
161+
duckdb.sql("select distinct stat from iwp order by stat;").show()
162+
```
163+
```
164+
┌──────────────┐
165+
│ stat │
166+
│ varchar │
167+
├──────────────┤
168+
│ iwp_coverage │
169+
└──────────────┘
170+
```
171+
172+
We get an almost immediate return because DuckDB can 1) look only at the `stat` column, ignoring the rest of the data and 2) take advantage of metadata and indexes on those columns, to avoid having to read the whole column anyways.
173+
174+
Finally, let's look at a query of the actual data. If we select just two of the columns, and filter the rows, we can do an ad-hoc query that returns a slice of the massive table very quickly.
175+
176+
```{python}
177+
#| eval: false
178+
low_coverage = iwp.project("bounds, sum").filter("sum < 10")
179+
low_coverage.count('*')
180+
```
181+
```
182+
┌──────────────┐
183+
│ count_star() │
184+
│ int64 │
185+
├──────────────┤
186+
│ 37169 │
187+
└──────────────┘
188+
```
189+
190+
And we can easily save our small 1.4 MB slice of the much larger table locally as a parquet file as well using `write_parquet`.
191+
192+
```{python}
193+
#| eval: false
194+
low_coverage.write_parquet("low_coverage.parquet")
195+
os.stat("low_coverage.parquet").st_size / (1024 * 1024)
196+
```
197+
```
198+
1.4038171768188477
199+
```
200+
201+
Amazingly, DuckDB and parquet handle all of this high-performance access without any server-side services running. Typically, remote data access would be provided through a server side service like a postgres database or some other heavyweight process. But in this case, all we have is a file on disk be served up by a standard web server, and all of the querying is done completely client-side, and quickly because of the beauty of the parquet file format.
202+
203+
Of course, you can also download the whole parquet file and access it locally through duckdb as well!
204+
205+
Connecting back to our earlier discussions of parallel processing, one could see how you could drop a large parquet dataset on a server, and then spin up a distributed, parallel processing model where each of the distributed processes use DuckDB to reach out and grab just the small chunk of data it needs to process and compute. Fast, lightweight, and scalable computing, with almost zero infrastructure!
206+
84207
## Arrow
85208

86209
So far, we have discussed the difference between organizing information in row-major and column-major format, how that applies to arrays, and how it applies to data storage on disk using Parquet.
@@ -91,7 +214,7 @@ Let's say that you have utilized the Parquet data format for more efficient stor
91214

92215
`pyarrow` is great, but relatively low level. It supports basic group by and aggregate functions, as well as table and dataset joins, but it does not support the full operations that `pandas` does.
93216

94-
## Example
217+
## Delta Fisheries using Arrow
95218

96219
In this example, we'll read in a dataset of fish abundance in the San Francisco Estuary, which is published in csv format on the [Environmental Data Initiative](https://portal.edirepository.org/nis/mapbrowse?scope=edi&identifier=1075&revision=1). This dataset isn't huge, but it is big enough (3 GB) that working with it locally can be fairly taxing on memory. Motivated by user difficulties in actually working with the data, the [`deltafish` R](https://github.com/Delta-Stewardship-Council/deltafish) package was written using the R implementation of `arrow`. It works by downloading the EDI repository data, writing it to a local cache in parquet format, and using `arrow` to query it. In this example, I've put the Parquet files in a sharable location so we can explore them using `pyarrow`.
97220

@@ -124,14 +247,14 @@ deltafish.files
124247
['/home/shares/deltafish/fish/Taxa=Acanthogobius flavimanus/part-0.parquet',
125248
'/home/shares/deltafish/fish/Taxa=Acipenser medirostris/part-0.parquet',
126249
'/home/shares/deltafish/fish/Taxa=Acipenser transmontanus/part-0.parquet',
127-
'/home/shares/deltafish/fish/Taxa=Acipenser/part-0.parquet'...
250+
'/home/shares/deltafish/fish/Taxa=Acipenser/part-0.parquet'...]
128251
```
129252

130253
You can view the columns of a dataset using `schema.to_string()`
131254

132255
```{python}
133256
#| eval: false
134-
deltafish.schema.to_string()
257+
deltafish.schema
135258
```
136259

137260
```
@@ -161,7 +284,7 @@ First read in the survey dataset.
161284

162285
```{python}
163286
#| eval: false
164-
survey = ds.dataset("/home/jclark/deltafish/survey",
287+
survey = ds.dataset("/home/shares/deltafish/survey",
165288
format="parquet",
166289
partitioning='hive')
167290
```
@@ -170,7 +293,7 @@ Take a look at the columns again:
170293

171294
```{python}
172295
#| eval: false
173-
survey.schema.to_string()
296+
survey.schema
174297
```
175298

176299
Let's pick out only the ones we are interested in.
@@ -186,16 +309,33 @@ Then do the join, and convert to a pandas `data.frame`.
186309
```{python}
187310
#| eval: false
188311
fish_j = fishf.join(survey_s, "SampleID").to_pandas()
312+
fish_j.head()
189313
```
190314

191315
Note that when we did our first manipulation of this dataset, we went from working with a `FileSystemDataset`, which is a representation of a dataset on disk without reading it into memory, to a `Table`, which is read into memory. `pyarrow` has a [number of functions](https://arrow.apache.org/docs/python/compute.html) that do computations on datasets without reading them into memory. However these are evaluated "eagerly," as opposed to "lazily." These are useful in some cases, like above, where we want to take a larger than memory dataset and generate a smaller dataset (via filter, or group by/summarize), but are not as useful if we need to do a join before our summarization/filter.
192316

193317
More functionality for lazy evaluation is on the horizon for `pyarrow` though, by leveraging [Ibis](https://ibis-project.org/docs/3.0.2/tutorial/01-Introduction-to-Ibis/).
194318

319+
## Parquet Performance hints
320+
321+
While the specifics of your data structures, organization, and access patterns will ultimately drive the best approach to representing tabular data, a few good practices have emerged that may help you. But even without these, just moving from text formats like CSV to the open binary format of parquet may net you some huge gains. A few things you might consider:
322+
323+
- Avoid Lots of Small Files
324+
325+
A lot of small files generally means a lot of metadata handling, and a lot of read and write operations that can add up. Modern filesystems can handle directories with millions of file entries, but that doesn't mean it is convenient to use them that way. While at times there are good reasons to break a file into multiple partitioned files, accumulating too many small files can bog down your system. For larger datasets, individual files of up to a Gigabyte will generally work well.
326+
327+
- Partition your Data
328+
329+
Partitioning your data into multiple parquet files, each containing one cluster of the data, can really speed things up when people only need to access a few of those partitions. If your partitioning scheme matches your data aceess patterns, then you can significantly lower the amount of data people need to access to get at what they want. A good rule of thumb is to create partitions that correspond to your access filters if there aren't too many values for that variable. For example, if people query by month, then a partition over month will allow access to just the subset of data that meets particular filter queries.
330+
331+
- Tune Row Groups
332+
333+
Row groups contain pages of column data and are the main storage area for parquet. By picking a good row group size, you can optimize your performance. Larger row groups will result in fewer I/O operations to read each chunk of column data, which means reads will likely be faster. But those reads will be larger, and so will require more memory. Also, if you want to do fine-grained parallel processing, it is helpful if the size of the row groups and chunks corresponds to the size of your processing jobs -- you don't want workers reading more data than needed because it is reading in a big row group. Row group size and your paritioning scheme work hand in hand to optimize the size of data chunks that are accessed.
334+
195335
## Synopsis
196336

197-
In this lesson we learned:
337+
In this lesson we focused on:
198338

199339
- the difference between row major and column major formats
200340
- under what circumstances a column major data format can improve memory efficiency
201-
- how `pyarrow` can interact with Parquet files to analyze data
341+
- how to use DuckDB and Arrow interact with Parquet files to scalably analyze data

0 commit comments

Comments
 (0)