Ensure Pypi simple index endpoint output is valid HTML5
🌱 Context
pip install
with GitLab's Python package registry is now raising warnings about not being HTML5 compliant.
Upstream issue:
https://github.com/pypa/pip/issues/10825
Warning reported for a self-hosted instance (up-to-date):
"DEPRECATION: The HTML index page being used (https://gitlab.kitware.com/api/v4/projects/13/packages/pypi/simple/vtk/) is not a proper HTML 5 document. This is in violation of PEP 503 which requires these pages to be well-formed HTML 5 documents. Please reach out to the owners of this index page, and ask them to update this index page to a valid HTML 5 document. pip 22.2 will enforce this behaviour change. Discussion can be found at https://github.com/pypa/pip/issues/10825"
💥 How to Replicate The Issue
Prerequisites:
- Newer version of pip do not show the warning. To replicate the issue, downgrade to pip version 22.02:
pip install pip==22.0.2
- Setup authentication
- Publish a package
Install the package you published above. With pip 22.0.2 or earlier, you'll get the message:
"DEPRECATION: The HTML index page being used (http://gdk.test:3000/api/v4/projects/7/packages/pypi/simple/mypypipackage/) is not a proper HTML 5 document. This is in violation of PEP 503 which requires these pages to be well-formed HTML 5 documents. Please reach out to the owners of this index page, and ask them to update this index page to a valid HTML 5 document. pip 22.2 will enforce this behaviour change. Discussion can be found at https://github.com/pypa/pip/issues/10825"
🚑 Solution
The HTML template used to generate the response is here
To fix the issue, update that template:
- Remove the leading spaces before
<!DOCTYPE html>
. The check is case-insensitive (https://github.com/pypa/pip/pull/10844) but does not strip leading spaces or tabs.
Because we're here anyway, let's do some housekeeping:
1. This is not needed but to pass the pip
check, but can help ensure HTML5 validity (according to the W3C validator)
* Add a language declaration:
<html lang="en">
* Add a charset declaration: <meta charset="utf-8">
2. We're digging into pip Python source anyway, so we might as well take the extra time to ensure the HTML we generate will be correctly parsed by pip. On top of ensuring we pass the
pip
check, let's do some sanity checks:- save the generated HTML of the index endpoint
- pass the HTML to the Python standard library's html.parser
(https://docs.python.org/3/library/html.parser.html) and verify that it is parsed correctly.
Additional Notes
UPDATE: The crossed-out "Let's do some housekeeping" above seemed like a good idea at the time, but thinking about it again, now seems unnecessary:
- The small whitespace fix is enough - we just need to satisfy the
pip
check - We already have specs that verify that we're generating a correct HTML document with the correct links
This is the check done by pip:
if actual_start.decode(encoding).lower() != "<!doctype html>":
- it is only checking if the doc starts with the doctype declaration (case-insensitive)
The current version of pip
(24.2) no longer does the HTML5 check (code). It just assumes that HTMLParser will parse it correctly.
This is OK. Our spec for spec/presenters/packages/pypi/simple_index_presenter_spec.rb
already tests that we generate a valid HTML document with the correct links.