Skip to content

glob and rglob can be slow for large directories or lots of files #274

@Gilthans

Description

@Gilthans

Hello!
While gleefully using cloudpathlib, I needed a recursive iteration of files in a directory. This directory is large (6238 files), so my first approach - a recursive iterdir() + is_file() - took waaaay too long (likely due to #176).
I remembered that glob made better use of the cloud list calls, so I tried list(p.rglob("*")). In my directory, that took 17m:34s.
I then tried to 'cheat' and call [f for f, is_dir in p.client._list_dir(p, recursive=True) if not is_dir]. It took 1.435s.

I looked at the glob logic, but I still can't understand why the discrepency (Using Google Cloud). This may warrant another issue, but

However, I wonder if it might not be a good idea to make _list_dir a public function in the meantime as a workaround.
Another option is to add recursive and/or files_only keywords to iterdir. This deviates from pathlib API, but since these are added keywords, it might be OK?

I'm suggesting these options even though solving #176 would probably solve most issues, but these solutions are much simpler.
I'd of course be happy to send a PR.
WDYT?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions