gh-150821: Skip URL parsing in mimetypes.guess_type() for file paths#150828
Open
gaborbernat wants to merge 1 commit into
Open
gh-150821: Skip URL parsing in mimetypes.guess_type() for file paths#150828gaborbernat wants to merge 1 commit into
gaborbernat wants to merge 1 commit into
Conversation
…paths guess_type() parsed every argument as a URL before checking the extension, even for plain file paths that have no scheme. Detect the no-scheme case and go straight to extension lookup, avoiding urlparse() and its lazy import. Real URLs keep the full parsing path; results are unchanged.
d35cf04 to
175764e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
mimetypes.guess_type()accepts either a URL or a filesystem path, so it parses its argument as a URL withurllib.parse.urlparse()before looking at the extension. The common argument is a plain file path, which has no URL scheme to find, so the parse — and theurllib.parseimport it triggers — is spent on nothing. Guessing content types from file names is everywhere: static-file servers, upload handlers, archive and build tools deciding how to treat each file as they walk a tree of thousands.A URL scheme requires a
:, so a path without one cannot be a URL. This detects that case and goes straight to extension lookup, skippingurlparse()and its lazy import. Real URLs, and the rare path that contains a:, still take the full parsing path, and results are unchanged for both.Guessing types for 15 real file names sampled from the top-1000 corpus improves from 23.4 µs to 11.0 µs, 112% faster.
Benchmark (pyperf)
Run base vs patched by swapping
Lib/mimetypes.pyon the same interpreter. The names are real file names sampled from the top-1000 corpus.Resolves #150821.