DuckDBClient by mbostock · Pull Request #313 · observablehq/stdlib

mbostock · 2022-11-02T22:31:21Z

This…

Mostly rewrites CMU’s DuckDBClient (with a proper implementation of queryStream).
Minimizes the exposed interface for DuckDBClient to reduce maintenance burden and discourage mutation.
Preserves stdlib’s Arrow as apache-arrow@4 for backwards compatibility.
Adds apache-arrow@9 (not 8) as a new dependency for DuckDB, per its package.json.
Uses jsDelivr’s esm.run to load @duckdb/duckdb-wasm and apache-arrow as an ES module (rather than require).
Minimizes the changes to recommended library versions in src/dependencies.mjs.
Adds CMU DIG’s BSD license to the derived code for DuckDBClient.

Demo:

TODO Document and test the DuckDBClient implementation.

mbostock · 2022-11-03T02:57:33Z

  "parserOptions": {
    "sourceType": "module",
-    "ecmaVersion": 2018
+    "ecmaVersion": 2020


This is needed for dynamic import. However, this probably means that this should be released as a major version bump since it means older JavaScript environments may not be able to support the new syntax. We could avoid this by reverting to the AMD (require) bundle for duckdb-wasm and apache-arrow, but it might be time to move forward. I still need to think about this a bit.

mbostock · 2022-11-03T03:03:49Z

/cc @domoritz if you want to review!

domoritz · 2022-11-03T11:19:50Z

Oh cool. I'll take a look.

kimmolinna · 2022-11-03T17:22:08Z

@mbostock this sounds really good. 👍

mbostock · 2022-11-07T16:04:32Z

+      async *readRows() {
+        try {
+          while (!batch.done) {
+            yield batch.value.toArray();


Not as part of this PR, but we should have an option to return Apache Arrow Table instances directly from the database client rather than converting to an array of objects (rows) here. Probably this is an option (arrow = true) when you run a query, and perhaps you can specify it as a default option when constructing the DuckDBClient. In conjunction with #315 it would mean that we could entirely avoid materializing the array of objects when displaying the results of a DuckDBClient query in a table.

At the moment DuckDBClient.query returns Apache Arrow Table so this sounds good once again.

And It would be really great if you could use Apache Arrow Table also with Plot.

@kimmolinna Agreed! Tracking that here: observablehq/plot#1103

Sweet. I think native arrow support throughout would be awesome since we reduce materializations.

mkfreeman · 2022-11-07T18:33:53Z

-      this._db = _db;
-      this._counter = 0;
+  async queryStream(query, params) {
+    const connection = await this._db.connect();


What type of object is this._db that has the .connect method?

A database client? Not seeing a .connect method in the specification

No, it is an AsyncDuckDB instance created here:

stdlib/src/duckdb.mjs

Line 120 in a0fef0c

const db = await createDuckDB();

The connect method is documented here:

https://shell.duckdb.org/docs/classes/index.AsyncDuckDB.html#connect

mkfreeman · 2022-11-07T18:39:26Z

-      } else {
-        result = await conn.query(query);
+  async query(query, params) {
+    const result = await this.queryStream(query, params);


Do we need to have an implementation of .query that doesn't rely on queryStream for some back-compatibility with existing DuckDB notebooks (e.g., this cell)? I imagine it's fine, as those imported versions of DuckDBClient will overwrite the standard library version, but just wanted to surface this as a possible concern.

(leaving some notes as we're pairing on the review, thinking out loud a bit) - does this mean that all DuckDBClients support streaming (as this uses the queryStream method)?

Anything in the standard library can be overridden, so any notebook that defines its own DuckDBClient won’t see the implementation here.

As far as this implementation goes, the fact that it implements db.query on top of db.queryStream should be totally invisible to the user. It implements the db.query method as specified in the database client specification and it only calls db.queryStream under the hood to reduce code duplication. It does not somehow mean that db.query or db.queryRow supports streaming; those methods still just return a promise to the results. But yes, this DuckDBClient does implement db.queryStream and hence it supports streaming.

I should also note that I didn’t implement the signal option for query methods in the DuckDBClient so there isn’t currently a way to cancel queries. But in practice that probably shouldn’t matter much because it’s an in-memory client and therefore most queries should be relatively fast. We could add it in the future.

mkfreeman · 2022-11-07T18:43:43Z

-      return Inputs.table(result);
-    }
+  async queryRow(query, params) {
+    const results = await this.query(query, params);


Since this.query calls this.queryStream as the first line, should we just call queryStream directly?

We could do that, but then we’d have to duplicate the rest of db.query, too (to call result.readRows e.g.). But I guess that’s worth it since we don’t need to read multiple batches just to get the first row.

Optimized in 6bf7aea.

mkfreeman · 2022-11-07T18:46:58Z

-      });
-      await conn.close();
+  async sql(strings, ...args) {
+    return await this.query(strings.join("?"), args);


Same comment as above (should we just use queryStream?)

No, then we’d need to duplicate the rest of db.query. db.queryStream doesn’t return a promise to an array of results; it returns a query stream response.

Got it, thanks for the clarification!

libbey-observable

Did some basic testing locally after pairing on the review with @mkfreeman. Looks good!

domoritz

This is great. Just added some comments for more flexible imports.

domoritz · 2022-11-07T21:06:59Z

+  }
+  {
+    const package = await resolve("apache-arrow@9");
+    console.log(`export const arrow9 = dependency("${package.name}", "${package.version}", "+esm");`);


Arrow 10 is out so you can update to that.

Going to stick with Arrow 9 until duckdb-wasm upgrades…

https://github.com/duckdb/duckdb-wasm/blob/d779c7d5b6758a2abade699d1b564e1140a68de1/packages/duckdb-wasm/package.json#L26

domoritz · 2022-11-07T21:09:28Z

+      async *readRows() {
+        try {
+          while (!batch.done) {
+            yield batch.value.toArray();


Sweet. I think native arrow support throughout would be awesome since we reduce materializations.

domoritz · 2022-11-07T21:14:13Z

+  const table = arrow.tableFromJSON(array);
+  const buffer = arrow.tableToIPC(table);
+  const connection = await database.connect();
+  try {


I think it would be good to pull out a method for inserting an arrow Table. This method would be used to insert data that is already in columnar form.

Added in 77ccac0.

domoritz · 2022-11-07T21:14:34Z

+          await connection.query(
+            `CREATE VIEW '${name}' AS SELECT * FROM parquet_scan('${file.name}')`
+          );
+        } else {


Add a case for file.name.endsWith(".arrow")

Added in 77ccac0.

domoritz · 2022-11-07T21:18:26Z

+    return {
+      schema: schema.fields.map(({name, type}) => ({
        name,
        type: getType(String(type)),


This is maybe more brittle than directly checking against Arrow types but probably okay for now. Can improve later.

Yes, I can use the Apache Arrow type identifiers when looking at the schema for an Apache Arrow Table, but the result of a DESCRIBE doesn’t give me the numeric identifiers — I only get the string column_type. I want to use the same code for describeColumns, too. It’d be nice if DuckDB returned the numeric type identifier in addition to the string.

Added in 990c240. And I was able to simplify the switch for DuckDB types since now we only need to handle the canonical types returned by DESCRIBE rather than the aliases. (At least, I hope that’s the case… it appears to be in my local testing.)

mbostock requested a review from mkfreeman November 2, 2022 22:31

DuckDBClient

198cfbb

mbostock force-pushed the mbostock/duckdb branch from 7cb30c9 to 198cfbb Compare November 3, 2022 02:54

default query.castTimestampToDate

a0fef0c

mbostock commented Nov 3, 2022

View reviewed changes

domoritz mentioned this pull request Nov 3, 2022

Add Observable client to duckdb-wasm duckdb/duckdb-wasm#1056

Closed

mbostock mentioned this pull request Nov 5, 2022

Apache Arrow for table cells #315

Closed

1 task

mbostock commented Nov 7, 2022

View reviewed changes

mkfreeman requested a review from libbey-observable November 7, 2022 17:36

mkfreeman reviewed Nov 7, 2022

View reviewed changes

optimize queryRow

6bf7aea

libbey-observable approved these changes Nov 7, 2022

View reviewed changes

domoritz approved these changes Nov 7, 2022

View reviewed changes

use arrow types for schema

990c240

mbostock force-pushed the mbostock/duckdb branch from 71fbda3 to a849d1b Compare November 8, 2022 04:08

Apache Arrow for DuckDBClient

77ccac0

mbostock force-pushed the mbostock/duckdb branch from a849d1b to 77ccac0 Compare November 8, 2022 04:10

mbostock merged commit 77ccac0 into mkfreeman/duckdb Nov 8, 2022

mbostock deleted the mbostock/duckdb branch November 8, 2022 04:21

domoritz mentioned this pull request Nov 11, 2022

Create new tables/views in a database client #319

Closed

Conversation

mbostock commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbostock commented Nov 3, 2022

Uh oh!

domoritz commented Nov 3, 2022

Uh oh!

kimmolinna commented Nov 3, 2022

Uh oh!

mbostock Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbostock Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

libbey-observable left a comment

Choose a reason for hiding this comment

Uh oh!

domoritz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbostock Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbostock Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbostock commented Nov 2, 2022 •

edited

Loading

mbostock Nov 7, 2022 •

edited

Loading

mbostock Nov 7, 2022 •

edited

Loading

mbostock Nov 8, 2022 •

edited

Loading

mbostock Nov 8, 2022 •

edited

Loading