Home Machine Learning My First Billion (of Rows) in DuckDB | by João Pedro | Might, 2024

My First Billion (of Rows) in DuckDB | by João Pedro | Might, 2024

0
My First Billion (of Rows) in DuckDB | by João Pedro | Might, 2024

[ad_1]

DuckDB is an Open Supply Challenge [OSD], the creator has no affiliation with DuckDB/DuckDB Labs. The information used is accessible within the ODbL License. This mission is totally free to hold out. It doesn’t require cost for any companies, information entry, or different bills.

The issue consists of processing data from Digital Poll Bins’ Logs to acquire statistical metrics concerning the voting time of Brazilian voters. For instance, calculate the common time residents use to vote, accumulate fingerprints for identification, and so forth. These metrics must be aggregated in a number of granularity ranges: on the nation degree, state, electoral zone, and electoral part.

In case you don’t know, Brazil has a 100% digital voting system, the place all of the 100+ million residents vote on a single day and the election’s result’s computed and launched close to real-time. Votes are collected by hundreds of digital poll bins unfold everywhere in the nation.

Digital poll field. Picture from the Brazillian Superior Electoral Court docket.

An digital poll field is a microcomputer for particular use for elections, with the next traits: resistant, small, mild, with power autonomy, and with safety features [4]. Every can maintain as much as 500 voters, a quantity chosen to keep away from huge queues within the voting places.

The system is run by the TSE (Supreme Electoral Court docket), which shares information concerning the course of in its open information portal [ODbL License]. The logs are textual content recordsdata with an exhaustive checklist of all occasions within the poll field.

And that’s the place the problem begins. Because the logs register completely each single occasion, it’s doable to calculate an unlimited quantity of metrics from them; it’s a vibrant data fountain. However what makes them wealthy, additionally makes them extraordinarily arduous to deal with, because the totality of all of the nation’s data reaches the milestone of 450Gb in TSV recordsdata with + 4 billion strains.

Apart from the amount, one other factor that makes this work a very good benchmark, in my view, is that the wanted transformations to succeed in our remaining purpose are from all kinds of complexities, from easy (the place, group by, order by) to advanced SQL operations (like home windows features).

With this comparatively excessive quantity of knowledge, one will be keen to evoke conventional Huge Knowledge instruments, like Apache Spark, and course of this information in a cluster with many staff, a number of gigabytes of RAM, and a dozen CPUs.

DuckDB was created to problem this establishment.

As its creator defends (on this video), it’s a database thought to empower single machines with the power to course of giant volumes of knowledge.

I.e., as a substitute of on the lookout for advanced trade options — like PySpark — or cloud-based options — like Google BigQuery — one will use an area in-process database with normal SQL to appreciate the wanted transformations.

So, in a nutshell, DuckDB is an in-process (that runs in this system itself, it has no unbiased course of, resembling SQLite), OLAP (adjusted to analytical masses), that handles information in conventional codecs (CSV, parquet), optimized to deal with giant volumes of knowledge utilizing the ability of a single machine (that doesn’t must be very highly effective).

A poll field’s log is a single TSV file with a standardized title — XXXXXYYYYZZZZ.csv, composed of the field’s location metadata, with the 5 first digits being town code, the following 4 the electoral zone (a geographical state’s subdivision), and the final 4 the electoral part (the poll field itself).

There are virtually 500,000 poll bins in Brazil, so, virtually 500.000 recordsdata. The file’s measurement will depend on the variety of voters within the part, which varies from 1 to 500. That is what the logs appear to be:

2022-10-02 09:35:17 INFO 67305985 VOTA Voter was enabled
2022-10-02 09:43:55 INFO 67305985 VOTA Vote confirmed for [Federal Deputy]
2022-10-02 09:48:39 INFO 67305985 VOTA Vote confirmed for [State Deputy]
2022-10-02 09:49:10 INFO 67305985 VOTA Vote confirmed for [Senator]
2022-10-02 09:49:47 INFO 67305985 VOTA Vote confirmed for [Governor]
2022-10-02 09:50:08 INFO 67305985 VOTA Vote confirmed for [President]
2022-10-02 09:50:09 INFO 67305985 VOTA The voter's vote was computed
# Literal Translations to English
# Occasions that symbolize a vote

We’re focused on remodeling this uncooked data into statistical metrics about voting time(How a lot time every voter takes to vote? What number of votes are computed every minute?) in a number of granularity ranges (nation, state, metropolis) and, to realize that, we’re going to create an OLAP Dice like that:

| State         | Metropolis              | Imply Voting Time (seconds) | Max Votes Computed in 5 Min |
|---------------|-------------------|----------------------------|-----------------------------|
| Null | Null | 50 | 260 |
| São Paulo | São Paulo | 30 | 300 |
| São Paulo | Campinas | 35 | 260 |
| São Paulo | Null | 20 | 260 |
| Rio de Janeiro| Rio de Janeiro | 25 | 360 |
| Minas Gerais | Belo Horizonte | 40 | 180 |
| Bahia | Salvador | 28 | 320 |
| Rio Grande ...| Porto Alegre | 30 | 300 |
| ... | ... | ... | ... |

Setup the surroundings

All that’s wanted to run this mission is a Python surroundings with the DuckDB bundle put in.

pip set up duckdb

Reworking the information

Within the following sections, I’ll describe every transformation, its aims, how DuckDB can carry out every one, the benefits, challenges, outcomes, and conclusions.

The processing is split into 4 steps: Convert TSV recordsdata to Parquet; Filter and Clear; Isolate votes and their attributes; and Compute metrics to the OLAP Dice.

Processing Steps. Picture by Creator.

Sadly, to keep away from making this put up monumental, I’ll not clarify every transformation intimately. However all of the code is accessible on the GitHub repository.

Changing TSV recordsdata to Parquet

A easy and indispensable step for anybody who desires to work with giant volumes of knowledge. Doing this on DuckDB is simple.

First, create a DuckDB session:

cursor = duckdb.join("")

On this instance, we instantiate the database connector with an empty string. That is achieved to point that DuckDB shouldn’t create its personal database file; moderately, it ought to solely work together with system recordsdata. As talked about earlier, DuckDB is a database, so it has the functionalities to create tables, views, and so forth, which we received’t discover right here. We’ll focus solely on utilizing it as a metamorphosis engine.

And outline the next question:

question = f"""
COPY (
SELECT
*
FROM read_csv('/information/logs/2_{state}/*.csv', filename=True)
) TO '{state}.parquet' (FORMAT 'parquet');
"""
cursor.execute(question)

And that’s all!

Let’s element the question:

The inside expression is simply a regular SELECT * FROM desk question, the one distinction is that, as a substitute of referencing a desk, DuckDB can reference recordsdata instantly.

The results of this question may very well be imported to a pandas dataframe for additional expression, identical to this:

my_df = cursor.execute(question).df()

Which permits seamless integration between DuckDB and pandas.

The outer expression is a straightforward COPY … TO … , which writes the inside question’s end result as a file.

On this first transformation, we will begin to see one of many strengths of DuckDB— the power to work together with recordsdata utilizing plain previous SQL, without having to configure anything. The above question will not be totally different in any respect from day-to-day operations that we make in normal SGBDs, like PostgreSQL and MySQL, with the one distinction being that, as a substitute of manipulating tables, we’re interacting with recordsdata.

Initially, we had 450Gb of TSV recordsdata and, after ~30min, we ended up with 97Gb of Parquet.

Filter and Clear

As talked about earlier, the Logs retailer each occasion that occurs on a poll field. This primary step goals to filter solely vote-related occasions, like ‘The voter voted for PRESIDENT’, ‘The Voter had fingerprints collected’, and ‘The vote was computed’ that occurred on the election days (that’s essential, because the logs additionally retailer coaching sections and different administrative procedures realized).

A easy question, however with plenty of textual content and date manipulations:


VOTES_DESCRIPTIONS = [
# VOTES
"event_description = 'Aguardando digitação do título'",
# Awaiting voter's title (Voter Registration ID) input
"event_description = 'Título digitado pelo mesário'",
# Voter's title entered by the poll worker
"event_description = 'Eleitor foi habilitado'",
# Voter has been enabled
"event_description ILIKE 'Voto confirmado par%'",
# Vote confirmed for ... could be [PRESIDENT, SENATOR, DEPUTY, ...]
"event_description = 'O voto do eleitor foi computado'",
# Voter's vote has been computed
]

ACCEPTED_DATES = [
'2022-10-02', '2022-10-30', # Constitutional date of the election filter
'2022-10-03', '2022-10-31',
]

question = F"""
SELECT
*
FROM (
SELECT
event_timestamp,
event_timestamp::date AS event_date,
event_type,
some_id,
event_system,
event_description,
event_id,

REPLACE(SPLIT_PART(filename, '/', 5), '_new.csv', '') AS filename,

-- Metadata from filename
SUBSTRING( SPLIT_PART(SPLIT_PART(filename, '/', 5), '-', 2), 1, 5 ) AS city_code,
SUBSTRING( SPLIT_PART(SPLIT_PART(filename, '/', 5), '-', 2), 6, 4 ) AS zone_code,
SUBSTRING( SPLIT_PART(SPLIT_PART(filename, '/', 5), '-', 2), 10, 4 ) AS section_code,
REPLACE(SPLIT_PART(filename, '/', 4), '2_', '') AS uf
FROM
{DATASET}
WHERE 1=1
AND ( {' OR '.be part of(VOTES_DESCRIPTIONS)} )
) _
WHERE 1=1
AND event_date IN ({', '.be part of([F"'{date}'" for date in ACCEPTED_DATES])})
"""

On this question, one other benefit of DuckDB is highlighted, the power to learn and write partitioned information. Desk partitioning could be very related within the context of Huge Knowledge, however continues to be much more important within the single-machine paradigm, provided that we’re working the identical disk for enter and output, i.e., it suffers twice, and each optimization is welcome.

Initially, we had 97Gb, however after ~30min, we had been left with 63Gb of Parquet.

Isolate votes and their attributes

As every vote consists of a number of strains, we have to condense all the data in a singular document, to ease the calculations. Right here issues get difficult, because the question will get advanced and, sadly, DuckDB couldn’t course of all the information in a single go.

To beat this situation, I did a loop to course of the information incrementally in slices:

for state in states:
for date in ACCEPTED_DATES:
for zone_group in ZONE_GROUPS:
question = F"""
COPY
{
complex_query_goes_here
.exchange('<uf>', state)
.exchange('<event_date>', date)
.exchange('<zone_id_min>', str(zone_group[0]))
.exchange('<zone_id_max>', str(zone_group[1]))
}
TO 'VOTES.parquet'
(FORMAT 'parquet', PARTITION_BY (event_date, uf, zone_group), OVERWRITE_OR_IGNORE 1);
"""

The implementation particulars don’t matter, the fascinating half is that we don’t want to vary the code an excessive amount of to construct this remaining desk incrementally. As every ‘slice’ processed represents a partition, by setting the parameter OVERWRITE_OR_IGNORE to 1, DuckDB will mechanically overwrite any present information for that partition or ignore it if it already exists.

Initially, we had 63GB, after ~1 hour and 20 minutes, we ended up with 15GB of Parquet.

Compute metrics and construct the OLAP Dice

This can be a easy step. Now, with every vote represented by a document, all wanted is to compute the metrics.

query_metrics = F"""
SELECT
turno, state,
zone_code,
section_code,

COUNT(*) AS total_votes,
COUNT( DISTINCT state || zone_code || section_code ) AS total_sections,

SUM( vote_time ) AS voting_time_sum,
AVG( vote_time ) AS average_voting_time,

MAX( nr_of_votes ) AS total_ballot_items_voted,
SUM( nr_of_keys_pressed ) AS total_keys_pressed

FROM
supply
GROUP BY ROLLUP(turno, state, zone_code, section_code)
"""

As we have to compute the metrics in lots of ranges of granularity, the best method to do that is with a GROUP BY + ROLLUP.

On this case, DuckDB stood out considerably: we began with 15 GB and, after 36 seconds, the file was diminished to 88 MB!

This can be a blazing quick efficiency, it grouped greater than 200 million rows in 4 totally different ranges of granularity, the place the best degree has cardinality=2 and, the bottom, cardinality=~200,000 in lower than a minute!

The desk beneath summarizes the outcomes:

The overall pipeline’s execution time was ~2h30min, executed on WSL with the next specs: ~16GB of DDR4 RAM, an Intel twelfth technology Core i7 processor, and a 1TB NVMe SSD.

In the course of the course of, I observed that reminiscence utilization was a bottleneck, as DuckDB continuously created short-term recordsdata within the disk in a .temp/ listing. Additionally, I had loads of issues in working queries with Home windows features: they not solely took extra time than anticipated to execute, but in addition this system randomly crashed a number of occasions.

Regardless of that, I consider that the efficiency reached was passable, in any case, we’re speaking about 1/2Tb of knowledge being processed with advanced queries by only one single machine (that’s not so robust, in contrast with clusters of computer systems).

The actual fact is that processing information is, generally, like refining uranium. We begin with an unlimited mass of uncooked materials and, via a tough, time-consuming, and expensive course of (that, generally, places lives in danger), we extract a small portion of the related refined data.

Jokes apart, in my posts, I’ve explored some ways to carry out information processing, speaking about instruments, methods, information architectures… at all times on the lookout for the easiest way of doing issues. This type of information is essential, because it helps us select the best software for the best job. The purpose of this put up was precisely to know what sort of job DuckDB solves, and what expertise it serves.

And, typically phrases, it was a very good expertise.

Working with this database was very easy, I didn’t must configure virtually something, simply imported and manipulated the information with plain-old SQL statements. In different phrases, the software has an virtually zero preliminary entry barrier for many who already know SQL and a bit little bit of Python. In my view, this was DuckDB’s huge victory. It not solely empowered my machine with the power to course of 450Gb of knowledge however this was achieved with a low adaptation price for the surroundings (and the programmer).

By way of processing pace, contemplating the complexity of the mission, the amount of 450Gb, and the truth that I didn’t optimize the database parameters, 2h30m was a very good end result. Particularly considering that, with out this software, it will be unimaginable, or extraordinarily advanced, to appreciate this process on my pc.

DuckDB is considerably between Pandas and Spark. For small volumes of knowledge, Pandas will be extra engaging when it comes to usability, particularly for people with some background in programming, because the bundle has many built-in transformations that may very well be difficult to implement in SQL. It additionally has seamless integration with many different Python packages, together with DuckDB. For big volumes of knowledge, Spark will in all probability be a greater different, with the parallelism, clusters, and all that stuff. So, DuckDB fills a blind spot of medium-to-not-so-large tasks, the place utilizing pandas could be unimaginable and Spark, overkill.

DuckDB extends the boundaries {that a} single machine can attain and expands the tasks that may be developed domestically, bringing pace to the evaluation/manipulation of enormous volumes of knowledge. Indubitably, it’s a highly effective software that I’ll proudly add to my toolbox.

Moreover, I hope this put up helped you get a greater view of DuckDB. As at all times, I’m not an knowledgeable in any of the themes addressed on this put up, and I strongly advocate additional studying, my references are listed beneath and the code is accessible on GitHub.

Thanks for studying! 😉

All of the code is accessible in this GitHub repository.
All for extra works like this one? Go to my
posts repository.

[1] 2022 Outcomes —Information transmitted for totalization— TSE Open Knowledge Portal. Hyperlink. [ODbL]
[2] Databricks. (2023, June 29). Knowledge + AI Summit Keynote, Thursday Half 5 — DuckDB. YouTube.
[3]DuckDB Official Documentation. DuckDB.
[4] The digital poll field. Superior Electoral Court docket.
[5] Wikipedia contributors. (2023, July 25). OLAP dice. Wikipedia.
[6] Duckdb — GitHub. window efficiency · Challenge #7809 · duckdb/duckdb.
[7] Gunnarmorling. GitHub — gunnarmorling/1brc: 1️⃣🐝🏎️ The One Billion Row Problem — A enjoyable exploration of how shortly 1B rows from a textual content file will be aggregated with Java.

[ad_2]