DuckDBClient#313
Conversation
7cb30c9 to
198cfbb
Compare
| "parserOptions": { | ||
| "sourceType": "module", | ||
| "ecmaVersion": 2018 | ||
| "ecmaVersion": 2020 |
There was a problem hiding this comment.
This is needed for dynamic import. However, this probably means that this should be released as a major version bump since it means older JavaScript environments may not be able to support the new syntax. We could avoid this by reverting to the AMD (require) bundle for duckdb-wasm and apache-arrow, but it might be time to move forward. I still need to think about this a bit.
|
/cc @domoritz if you want to review! |
|
Oh cool. I'll take a look. |
|
@mbostock this sounds really good. 👍 |
| async *readRows() { | ||
| try { | ||
| while (!batch.done) { | ||
| yield batch.value.toArray(); |
There was a problem hiding this comment.
Not as part of this PR, but we should have an option to return Apache Arrow Table instances directly from the database client rather than converting to an array of objects (rows) here. Probably this is an option (arrow = true) when you run a query, and perhaps you can specify it as a default option when constructing the DuckDBClient. In conjunction with #315 it would mean that we could entirely avoid materializing the array of objects when displaying the results of a DuckDBClient query in a table.
There was a problem hiding this comment.
At the moment DuckDBClient.query returns Apache Arrow Table so this sounds good once again.
There was a problem hiding this comment.
And It would be really great if you could use Apache Arrow Table also with Plot.
There was a problem hiding this comment.
@kimmolinna Agreed! Tracking that here: observablehq/plot#1103
There was a problem hiding this comment.
Sweet. I think native arrow support throughout would be awesome since we reduce materializations.
| this._db = _db; | ||
| this._counter = 0; | ||
| async queryStream(query, params) { | ||
| const connection = await this._db.connect(); |
There was a problem hiding this comment.
What type of object is this._db that has the .connect method?
There was a problem hiding this comment.
A database client? Not seeing a .connect method in the specification
There was a problem hiding this comment.
No, it is an AsyncDuckDB instance created here:
Line 120 in a0fef0c
The connect method is documented here:
https://shell.duckdb.org/docs/classes/index.AsyncDuckDB.html#connect
| } else { | ||
| result = await conn.query(query); | ||
| async query(query, params) { | ||
| const result = await this.queryStream(query, params); |
There was a problem hiding this comment.
Do we need to have an implementation of .query that doesn't rely on queryStream for some back-compatibility with existing DuckDB notebooks (e.g., this cell)? I imagine it's fine, as those imported versions of DuckDBClient will overwrite the standard library version, but just wanted to surface this as a possible concern.
There was a problem hiding this comment.
(leaving some notes as we're pairing on the review, thinking out loud a bit) - does this mean that all DuckDBClients support streaming (as this uses the queryStream method)?
There was a problem hiding this comment.
Anything in the standard library can be overridden, so any notebook that defines its own DuckDBClient won’t see the implementation here.
As far as this implementation goes, the fact that it implements db.query on top of db.queryStream should be totally invisible to the user. It implements the db.query method as specified in the database client specification and it only calls db.queryStream under the hood to reduce code duplication. It does not somehow mean that db.query or db.queryRow supports streaming; those methods still just return a promise to the results. But yes, this DuckDBClient does implement db.queryStream and hence it supports streaming.
I should also note that I didn’t implement the signal option for query methods in the DuckDBClient so there isn’t currently a way to cancel queries. But in practice that probably shouldn’t matter much because it’s an in-memory client and therefore most queries should be relatively fast. We could add it in the future.
| return Inputs.table(result); | ||
| } | ||
| async queryRow(query, params) { | ||
| const results = await this.query(query, params); |
There was a problem hiding this comment.
Since this.query calls this.queryStream as the first line, should we just call queryStream directly?
There was a problem hiding this comment.
We could do that, but then we’d have to duplicate the rest of db.query, too (to call result.readRows e.g.). But I guess that’s worth it since we don’t need to read multiple batches just to get the first row.
| }); | ||
| await conn.close(); | ||
| async sql(strings, ...args) { | ||
| return await this.query(strings.join("?"), args); |
There was a problem hiding this comment.
Same comment as above (should we just use queryStream?)
There was a problem hiding this comment.
No, then we’d need to duplicate the rest of db.query. db.queryStream doesn’t return a promise to an array of results; it returns a query stream response.
There was a problem hiding this comment.
Got it, thanks for the clarification!
libbey-observable
left a comment
There was a problem hiding this comment.
Did some basic testing locally after pairing on the review with @mkfreeman. Looks good!
domoritz
left a comment
There was a problem hiding this comment.
This is great. Just added some comments for more flexible imports.
| } | ||
| { | ||
| const package = await resolve("apache-arrow@9"); | ||
| console.log(`export const arrow9 = dependency("${package.name}", "${package.version}", "+esm");`); |
There was a problem hiding this comment.
Arrow 10 is out so you can update to that.
There was a problem hiding this comment.
Going to stick with Arrow 9 until duckdb-wasm upgrades…
| async *readRows() { | ||
| try { | ||
| while (!batch.done) { | ||
| yield batch.value.toArray(); |
There was a problem hiding this comment.
Sweet. I think native arrow support throughout would be awesome since we reduce materializations.
| const table = arrow.tableFromJSON(array); | ||
| const buffer = arrow.tableToIPC(table); | ||
| const connection = await database.connect(); | ||
| try { |
There was a problem hiding this comment.
I think it would be good to pull out a method for inserting an arrow Table. This method would be used to insert data that is already in columnar form.
| await connection.query( | ||
| `CREATE VIEW '${name}' AS SELECT * FROM parquet_scan('${file.name}')` | ||
| ); | ||
| } else { |
There was a problem hiding this comment.
Add a case for file.name.endsWith(".arrow")
| return { | ||
| schema: schema.fields.map(({name, type}) => ({ | ||
| name, | ||
| type: getType(String(type)), |
There was a problem hiding this comment.
This is maybe more brittle than directly checking against Arrow types but probably okay for now. Can improve later.
There was a problem hiding this comment.
Yes, I can use the Apache Arrow type identifiers when looking at the schema for an Apache Arrow Table, but the result of a DESCRIBE doesn’t give me the numeric identifiers — I only get the string column_type. I want to use the same code for describeColumns, too. It’d be nice if DuckDB returned the numeric type identifier in addition to the string.
There was a problem hiding this comment.
Added in 990c240. And I was able to simplify the switch for DuckDB types since now we only need to handle the canonical types returned by DESCRIBE rather than the aliases. (At least, I hope that’s the case… it appears to be in my local testing.)
71fbda3 to
a849d1b
Compare
a849d1b to
77ccac0
Compare
This…
Arrowas apache-arrow@4 for backwards compatibility.Demo:
TODO Document and test the DuckDBClient implementation.