perf(az): optimize non-recursive _list_dir#447
perf(az): optimize non-recursive _list_dir#447M0dEx wants to merge 4 commits intodrivendataorg:masterfrom
_list_dir#447Conversation
_list_dir
|
Thanks, @M0dEx. Could you also run before and after perf numbers for Azure? You'll need to run this command but with You should get a table you can paste into the PR like this: |
dfdbd8b to
1af198a
Compare
Added performance tests before and after. |
1af198a to
8f27b82
Compare
|
Looks like the performance generally is the same, with |
8f27b82 to
a230114
Compare
|
Pushed a fix for the failing tests (apparently caused by the |
a230114 to
2783e38
Compare
|
Removed the |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #447 +/- ##
========================================
- Coverage 93.7% 93.3% -0.5%
========================================
Files 23 23
Lines 1654 1659 +5
========================================
- Hits 1551 1548 -3
- Misses 103 111 +8
|
pjbull
left a comment
There was a problem hiding this comment.
I've got a few questions about the approach here. Preferably we can use a more explicitly supported version of things.
| is_folder = ( | ||
| metadata.content_settings.content_type is None | ||
| and metadata.content_settings.content_md5 is None | ||
| ) |
There was a problem hiding this comment.
Can you point to a more specific reference? I don't see the link you sent supporting the claim that this is definitive for folders.
It does seem like x-ms-resource-type is useful to answer the question directly, but only for certain configurations so could be included as an optimization. It may be the case that all the scenarios where those vars are none are also ones where x-ms-resource-type is available.
There was a problem hiding this comment.
I agree the reference is sort of vague. I was mainly going of off these two statements:
Content-Type: The content type that's specified for the blob. If no content type is specified, the default content type is application/octet-stream.
Content-MD5: If the Content-MD5 header has been set for the blob, this response header is returned so that the client can check for message content integrity.
In version 2012-02-12 and later, Put Blob sets a block blob’s MD5 value even when the Put Blob request doesn’t include an MD5 header.
Using this, coupled with our observations from real-life usage, we concluded that these parameters are always None for folders and never None for blobs, even if they are empty.
There was a problem hiding this comment.
Does your configuration have the x-ms-resource-type header for these folder blobs?
There was a problem hiding this comment.
That header does not seem to be returned within BlobProperties, nor can I see it in the in-browser Azure storage explorer.
|
|
||
| for blob in blobs: | ||
| # walk_blobs returns folders with a trailing slash | ||
| blob_path = blob.name.rstrip("/") |
There was a problem hiding this comment.
Is it problematic to keep the trailing slashes?
There was a problem hiding this comment.
I did it to be consistent with the current behaviour, but we can simply not remove the trailing slashes.
|
Incorporated into #453, thanks! |
As described in #446, the current implementation of
_list_dirin theAzureBlobClientcan be painfully slow for directories with a high level of nesting, even if the result is supposed to be non-recursive.This PR significantly improves the performance by using a newer portion of the Azure Blob Storage API, more specifically the
BlobServiceClient.walk_blobsfunction, which allows for much faster non-recursive iteration over the contents of a directory. Additionally, it is now possible to recursively iterate over the root of an account - over all containers.Furthermore, the detection of whether a blob is a file or a directory has been improved, using
content_typeandcontent_md5properties as described by the Azure Blob API.Closes #446.
Closes #444.
Performance before:
Performance after: