DevHeads.net

Apache converts GZIPed data into UTF-8 - bug or feature?

Hello,

Configuring a REVERSE PROXY, I try to *relocate* the "mountpoint"
URL; i.e. change the filepath, so that <a href="http://myhost/stage/myapp" title="http://myhost/stage/myapp">http://myhost/stage/myapp</a> will
reach the backend server as <a href="http://backend/myapp" title="http://backend/myapp">http://backend/myapp</a>.

This seems to me a fairly commen task, as one often has an app server
running it's app on it's server-root, while it needs to be published
under a specific path.

The doc says one can do it this way:

Location </stage>
ProxyPass "http://backend:5970"
ProxyPassReverse "http://backend:5970"
</Location>

I found that this doesn't help me much, because it does not
relocate the URLs in the body of a document. To solve this,
I found to include "proxy_html_module", according the instructions
in "extra/proxy-html.conf": activated these features

LoadFile /usr/local/lib/libxml2.so
LoadModule proxy_html_module libexec/apache24/mod_proxy_html.so
LoadModule xml2enc_module libexec/apache24/mod_xml2enc.so

and added this to my Location-Container:

ProxyHTMLEnable On
ProxyHTMLURLMap <a href="http://backend:5970/" title="http://backend:5970/">http://backend:5970/</a> /stage/
ProxyHTMLURLMap / /stage/

This nicely solved my problem, but now weird errors appeared, which
took me a night to hunt down. I finally figured the problem is
the xml2enc_module, which does very serious damage: When the backend
sends a CSS stylesheet file, it looks this way:

From Backend to Apache:

From Apache to Client:

We can see that the Content-Type was modified to mention "utf-8", and
the size has increased from 20 to 24 bytes.

Let's look at the content:

From Backend to Apache:

0x00f0: 1f8b 0800 e4ca b25c 0003 .......\..
0x0100: 0300 0000 0000 0000 0000 ..........

This is the correct 20-byte hexcode of a gzip'd file of length 0.

From Apache to Client:

0x0140: 1fc2 8b08 00c3 ......
0x0150: a4c3 8ac2 b25c 0003 0300 0000 0000 0000 .....\..........
0x0160: 0000 ..

This is obviousely valid UTF-8 text.

But no browser can make anything of this, because it cannot
be reverted to the original gzip data, which is not a charset,
it is binary!
What we get instead is a load error in the Web Developer, or,
if we try to load the CSS-file directly, it says:

Content Encoding Error
The page you are trying to view cannot be shown because it uses an invalid or unsupported form of compression.

(not very helpful either, so this gives quite a while to search
around, if one is not specifically involved in Web technology
and does this just for fun.)

The easy workaround is to switch off that xml2enc_module. But then
there are these annoying warnings when starting the server:
[Sun Apr 14 09:24:06.153900 2019] [proxy_html:notice] [pid 48178] AH01425: I18n support in mod_proxy_html requires mod_xml2enc. Without it, non-ASCII characters in proxied pages are likely to display incorrectly.

(Uh, hm. It does *not* mention about _bytes_ in _gzip_ data that appear to
appear incorrectly _WITH_ it.)

Anyway, I think this is so bogus that bogus is no longer a word for it.
Why is this happening, and what is to blame?

~~~~~~~~~~~~~
Server says:
Version: Apache/2.4.39 (FreeBSD) PHP/7.2.17 mod_scgi/1.15 OpenSSL/1.0.2o-freebsd
Server Built: unknown
Server loaded APR Version: 1.6.5
Compiled with APR Version: 1.6.5
Server loaded APU Version: 1.6.1
Compiled with APU Version: 1.6.1
Module Magic Number: 20120211:84
[and lots more of such; in case any is of interest for this matter, just ask]

$ pkg which /usr/local/lib/libxml2.so
/usr/local/lib/libxml2.so was installed by package libxml2-2.9.8

$ uname -a
FreeBSD myhost 11.2-RELEASE-p9 FreeBSD 11.2-RELEASE-p9 #0 r343946M#C51:240: Thu Mar 28 03:44:30 UTC 2019 root@myhost:/usr/src/sys/i386/compile/E1R11V1 i386