DevHeads.net

Apache converts GZIPed data into UTF-8 - bug or feature?

Hello,

Configuring a REVERSE PROXY, I try to *relocate* the "mountpoint"
URL; i.e. change the filepath, so that <a href="http://myhost/stage/myapp" title="http://myhost/stage/myapp">http://myhost/stage/myapp</a> will
reach the backend server as <a href="http://backend/myapp" title="http://backend/myapp">http://backend/myapp</a>.

This seems to me a fairly commen task, as one often has an app server
running it's app on it's server-root, while it needs to be published
under a specific path.

The doc says one can do it this way:

Location </stage>
ProxyPass "http://backend:5970"
ProxyPassReverse "http://backend:5970"
</Location>

I found that this doesn't help me much, because it does not
relocate the URLs in the body of a document. To solve this,
I found to include "proxy_html_module", according the instructions
in "extra/proxy-html.conf": activated these features

LoadFile /usr/local/lib/libxml2.so
LoadModule proxy_html_module libexec/apache24/mod_proxy_html.so
LoadModule xml2enc_module libexec/apache24/mod_xml2enc.so

and added this to my Location-Container:

ProxyHTMLEnable On
ProxyHTMLURLMap <a href="http://backend:5970/" title="http://backend:5970/">http://backend:5970/</a> /stage/
ProxyHTMLURLMap / /stage/

This nicely solved my problem, but now weird errors appeared, which
took me a night to hunt down. I finally figured the problem is
the xml2enc_module, which does very serious damage: When the backend
sends a CSS stylesheet file, it looks this way:

From Backend to Apache:

From Apache to Client:

We can see that the Content-Type was modified to mention "utf-8", and
the size has increased from 20 to 24 bytes.

Let's look at the content:

From Backend to Apache:

0x00f0: 1f8b 0800 e4ca b25c 0003 .......\..
0x0100: 0300 0000 0000 0000 0000 ..........

This is the correct 20-byte hexcode of a gzip'd file of length 0.

From Apache to Client:

0x0140: 1fc2 8b08 00c3 ......
0x0150: a4c3 8ac2 b25c 0003 0300 0000 0000 0000 .....\..........
0x0160: 0000 ..

This is obviousely valid UTF-8 text.

But no browser can make anything of this, because it cannot
be reverted to the original gzip data, which is not a charset,
it is binary!
What we get instead is a load error in the Web Developer, or,
if we try to load the CSS-file directly, it says:

Content Encoding Error
The page you are trying to view cannot be shown because it uses an invalid or unsupported form of compression.

(not very helpful either, so this gives quite a while to search
around, if one is not specifically involved in Web technology
and does this just for fun.)

The easy workaround is to switch off that xml2enc_module. But then
there are these annoying warnings when starting the server:
[Sun Apr 14 09:24:06.153900 2019] [proxy_html:notice] [pid 48178] AH01425: I18n support in mod_proxy_html requires mod_xml2enc. Without it, non-ASCII characters in proxied pages are likely to display incorrectly.

(Uh, hm. It does *not* mention about _bytes_ in _gzip_ data that appear to
appear incorrectly _WITH_ it.)

Anyway, I think this is so bogus that bogus is no longer a word for it.
Why is this happening, and what is to blame?

~~~~~~~~~~~~~
Server says:
Version: Apache/2.4.39 (FreeBSD) PHP/7.2.17 mod_scgi/1.15 OpenSSL/1.0.2o-freebsd
Server Built: unknown
Server loaded APR Version: 1.6.5
Compiled with APR Version: 1.6.5
Server loaded APU Version: 1.6.1
Compiled with APU Version: 1.6.1
Module Magic Number: 20120211:84
[and lots more of such; in case any is of interest for this matter, just ask]

$ pkg which /usr/local/lib/libxml2.so
/usr/local/lib/libxml2.so was installed by package libxml2-2.9.8

$ uname -a
FreeBSD myhost 11.2-RELEASE-p9 FreeBSD 11.2-RELEASE-p9 #0 r343946M#C51:240: Thu Mar 28 03:44:30 UTC 2019 root@myhost:/usr/src/sys/i386/compile/E1R11V1 i386

Comments

[patch] Apache converts GZIPed data into UTF-8 - 2

By Peter at 04/15/2019 - 09:44

Oh, nobody has an answer to the issue?

Okay...

Investigating, it appears that mod_xml2enc indeed grabs everything it
can lay hands on, if only it is tagged as some 'text/whatver', and
"converts" it (assuming it were ISO8859-1), no matter the damage, and
giving a f*** damn on compressed data. :((

This gets obvious from the code, it is also visible in the
debuglog:

[proxy_http:trace3] [pid 52505] mod_proxy_http.c(1402): [client 192.168.97.18:28882] Status from backend: 200
[proxy_http:trace4] [pid 52505] mod_proxy_http.c(1052): [client 192.168.97.18:28882] Headers received from backend:
[proxy_http:trace4] [pid 52505] mod_proxy_http.c(1075): [client 192.168.97.18:28882] Last-Modified: Sun, 14 Apr 2019 05:53:26 GMT
[proxy_http:trace4] [pid 52505] mod_proxy_http.c(1075): [client 192.168.97.18:28882] Content-Type: text/css
[proxy_http:trace4] [pid 52505] mod_proxy_http.c(1075): [client 192.168.97.18:28882] Content-Encoding: gzip
[proxy_http:trace4] [pid 52505] mod_proxy_http.c(1075): [client 192.168.97.18:28882] Vary: Accept-Encoding
[proxy_http:trace4] [pid 52505] mod_proxy_http.c(1075): [client 192.168.97.18:28882] Content-Length: 6194
[proxy_http:trace3] [pid 52505] mod_proxy_http.c(1672): [client 192.168.97.18:28882] start body send
[xml2enc:debug] [pid 52505] mod_xml2enc.c(176): [client 192.168.97.18:28882] AH01430: Content-Type is text/css
[xml2enc:debug] [pid 52505] mod_xml2enc.c(250): [client 192.168.97.18:28882] AH01434: Charset ISO-8859-1 not supported by libxml2; trying apr_xlate
[xml2enc:debug] [pid 52505] mod_xml2enc.c(464): [client 192.168.97.18:28882] AH01439: xml2enc: consuming 6194 bytes from bucket
[xml2enc:debug] [pid 52505] mod_xml2enc.c(490): [client 192.168.97.18:28882] AH01441: xml2enc: converted 4049/6193 bytes
[xml2enc:debug] [pid 52505] mod_xml2enc.c(490): [client 192.168.97.18:28882] AH01441: xml2enc: converted 2145/3242 bytes
[proxy_html:trace1] [pid 52505] mod_proxy_html.c(832): [client 192.168.97.18:28882] Non-HTML content; not inserting proxy-html filter
[http:trace3] [pid 52505] http_filters.c(1125): [client 192.168.97.18:28882] Response sent with status 200, headers:
[http:trace5] [pid 52505] http_filters.c(1134): [client 192.168.97.18:28882] Date: Sun, 14 Apr 2019 16:07:20 GMT
[http:trace5] [pid 52505] http_filters.c(1137): [client 192.168.97.18:28882] Server: Apache/2.4.39 (FreeBSD)
[http:trace4] [pid 52505] http_filters.c(955): [client 192.168.97.18:28882] Last-Modified: Sun, 14 Apr 2019 05:53:26 GMT
[http:trace4] [pid 52505] http_filters.c(955): [client 192.168.97.18:28882] Content-Type: text/css;charset=utf-8
[http:trace4] [pid 52505] http_filters.c(955): [client 192.168.97.18:28882] Content-Encoding: gzip
[http:trace4] [pid 52505] http_filters.c(955): [client 192.168.97.18:28882] Vary: Accept-Encoding
[http:trace4] [pid 52505] http_filters.c(955): [client 192.168.97.18:28882] Keep-Alive: timeout=15, max=100
[http:trace4] [pid 52505] http_filters.c(955): [client 192.168.97.18:28882] Connection: Keep-Alive
[http:trace4] [pid 52505] http_filters.c(955): [client 192.168.97.18:28882] Transfer-Encoding: chunked

Then, depending on which filters are configured, this may or may not
happen. It may even be runtime dependent. I tried to put proxy_html
into a filter chain to get a more defined behaviour, but this is not
possible, it produces a configuration error with FilterProvider,
although the documentation says:
"Any content filter may be used as a provider to mod_filter;
no change to existing filter modules is required"
So this does not work, either.

Finally I decided to fix the code, as good as I can. (As stated before,
I have absolutely no idea about this stuff and it's conventions, I just
need to make the thing workable.)
if (!ctx || !f->r->content_type) {
@@ -324,6 +325,17 @@
return ap_pass_brigade(f->next, bb) ;
}

+ if((c_enc = apr_table_get(f->r->headers_out, "Content-Encoding")) &&
+ !strstr(c_enc, "identity") &&
+ !apr_table_get(f->r->notes, "X-PMc-was-here")) {
+ ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, f->r, APLOGNO(66666)
+ "Probable deflated content, standing down") ;
+ ap_remove_output_filter(f);
+ return ap_pass_brigade(f->next, bb) ;
+ } else {
+ apr_table_set(f->r->notes, "X-PMc-was-here", "1");
+ }
+
if (ctx->bbsave == NULL) {
ctx->bbsave = apr_brigade_create(f->r->pool,
f->r->connection->bucket_alloc);

Re: [patch] Apache converts GZIPed data into UTF-8

By Nick Kew at 04/15/2019 - 12:21

Well I might have done, but I was out rehearsing and performing Bach,
not reading your email!

Heh.

Well, you've identified an issue, albeit in rather colourful language!
mod_proxy_html knows to remove itself from the chain when it sees non-HTML,
but mod_xml2enc doesn't.

Probably my fault.

At which point, you want the same reaction from xml2enc as from proxy_html.
i.e. remove itself and leave your contents untouched.

That looks to me like a problem with your libxml2.
But that's outside the scope of this discussion.

Did you misspell it? It's proxy-html (hyphen, not underscore).

Hmm. Your fix does the job for you, but shouldn't be necessary.

I'm thinking, mod_proxy_html does the right thing removing itself.
mod_xml2enc should do the same when inserted by mod_proxy_html.
That should be straightforward to fix. I'll take a look later today.

Thanks for the detailed analysis!

Re: [patch] Apache converts GZIPed data into UTF-8

By Peter at 04/15/2019 - 18:46

On Mon, Apr 15, 2019 at 05:21:27PM +0100, Nick Kew wrote:
! > Oh, nobody has an answer to the issue?
!
! Well I might have done, but I was out rehearsing and performing Bach,
! not reading your email!

Oh, You're perfectly welcome to do so!

In fact I was just hoping for *any* reply - I didn't have the hope to
actually reach somebody deeply involved. Your reply is highly
appreciated!!

! mod_proxy_html knows to remove itself from the chain when it sees non-HTML,
! but mod_xml2enc doesn't.

From my viewpoint, the problem seemed to be that xml2enc is always
pulled into the process-chain, no matter if one wants it or not, and
the (appearingly) only way to avoid that being to not load the module
(and living with the warnings issued on server start).

! > [xml2enc:debug] [pid 52505] mod_xml2enc.c(176): [client 192.168.97.18:28882] AH01430: Content-Type is text/css
!
! At which point, you want the same reaction from xml2enc as from proxy_html.
! i.e. remove itself and leave your contents untouched.

Not really, but that would be a viable approach in the sense of
"do-the-least-unexpected".

No, I would indeed like to run the xml2enc on all kinds of text
(because that may ease my issue with the always-postponed character
coding cleanup on my 20+ years old machines); I just want it to run
where _I_ want it to run - and definitely not on compressed data.

! > [xml2enc:debug] [pid 52505] mod_xml2enc.c(250): [client 192.168.97.18:28882] AH01434: Charset ISO-8859-1 not supported by libxml2; trying apr_xlate
!
! That looks to me like a problem with your libxml2.
! But that's outside the scope of this discussion.

Hm. Another piece of software I never looked at...

! > Then, depending on which filters are configured, this may or may not
! > happen. It may even be runtime dependent. I tried to put proxy_html
! > into a filter chain to get a more defined behaviour, but this is not
! > possible, it produces a configuration error with FilterProvider,
!
! Did you misspell it? It's proxy-html (hyphen, not underscore).

Now that's a hint! Indeed, I probably missed that one - I tried
with and without underscore, upper and lowercase, but likely
not the hyphen... and I failed to find the place in the source
where that name is declared. (Now, knowing the spelling, it is
easy to find ;))

And indeed! That works like I had hoped for - with
"ProxyHTMLEnable Off" and properly steered from the FilterChain,
so I can suppress it on proably compressed objects.
But it seems proxy-html does not even invoke xml2enc when called
in the filter chain - so the whole issue vaporizes in beauty. ;)

Nevertheless, the average stupid user (like me) might likely start
with the most simple configuration, and might run into this, and
would have a hard time figuring what is actually wrong; so we should
do something about it, and spare them a night searching.

! > Finally I decided to fix the code, as good as I can. (As stated before,
! > I have absolutely no idea about this stuff and it's conventions, I just
! > need to make the thing workable.)
!
! Hmm. Your fix does the job for you, but shouldn't be necessary.

No, it's just that I didn't get it running discretely from the
filter chain.

! I'm thinking, mod_proxy_html does the right thing removing itself.
! mod_xml2enc should do the same when inserted by mod_proxy_html.

Yepp, and leave the option to insert xml2enc explicitely for other
kind of files, if one wants to do that! Agreed!

Whereas, in an ideal world, mod_proxy_html would not stand down, but
would fixup the URLs in the stylesheet-documents as well.
But then, most people are concerned about performance and use an
asset-server anyway and not get such documents from the backend
(while I am just using Rails as scriptable database-GUI that I can
reach from anywhere in the world, disregarding performance), so
public demand for this may be limited; and it can nicely be done with
substitute.

! Thanks for the detailed analysis!

Thanks for the (unvoluntary) invitation to look a bit deeper
into the internals of that apache beast. :))

cheerio,
PMc

Re: [patch] Apache converts GZIPed data into UTF-8

By Nick Kew at 04/15/2019 - 18:43

On Mon, 15 Apr 2019 17:21:27 +0100

OK, I've looked.

What I'd like to do - pass responsibility back to the module
that inserted the xml2enc filter - calls for a minor API
change, so isn't going to happen in 2.4.x. A variant on
that approach might work, but right now I don't see anything
better than replicating mod_proxy_html's logic in mod_xml2enc
to deal with the situation where they're interacting.

Your check on content-encoding can also looks good.
Except that unless I'm missing something, your use of f->r->notes
is unnecessary: ap_remove_output_filter means we don't revisit
that code!

Re: [patch] Apache converts GZIPed data into UTF-8

By Peter at 04/15/2019 - 21:51

Hi Nick,

! OK, I've looked.

me too. ;)

! What I'd like to do - pass responsibility back to the module
! that inserted the xml2enc filter - calls for a minor API
! change, so isn't going to happen in 2.4.x. A variant on
! that approach might work, but right now I don't see anything
! better than replicating mod_proxy_html's logic in mod_xml2enc
! to deal with the situation where they're interacting.
!
! Your check on content-encoding can also looks good.
! Except that unless I'm missing something, your use of f->r->notes
! is unnecessary: ap_remove_output_filter means we don't revisit
! that code!

Yes, it were unnecessary, but for a different reason: my code is
currently not at the proper place.
Given a chain DEFLATE;XML2ENC;INFLATE it looks like this:

[filter:trace4] [pid 77874] util_expr_eval.c(858): [client 192.168.97.18:65401] Evaluation of expression from /usr/local/etc/apache24/extra/httpd-ruby.conf:126 gave: 1
[filter:trace2] [pid 77874] mod_filter.c(159): [client 192.168.97.18:65401] Expression condition for 'inflate' matched
[filter:trace4] [pid 77874] util_expr_eval.c(858): [client 192.168.97.18:65401] Evaluation of expression from /usr/local/etc/apache24/extra/httpd-ruby.conf:127 gave: 1
[filter:trace2] [pid 77874] mod_filter.c(159): [client 192.168.97.18:65401] Expression condition for 'xml2enc' matched
[xml2enc:debug] [pid 77874] mod_xml2enc.c(176): [client 192.168.97.18:65401] AH01430: Content-Type is text/css
[xml2enc:debug] [pid 77874] mod_xml2enc.c(250): [client 192.168.97.18:65401] AH01434: Charset ISO-8859-1 not supported by libxml2; trying apr_xlate
[xml2enc:debug] [pid 77874] mod_xml2enc.c(476): [client 192.168.97.18:65401] AH01439: xml2enc: consuming 8096 bytes from bucket
[xml2enc:debug] [pid 77874] mod_xml2enc.c(502): [client 192.168.97.18:65401] AH01441: xml2enc: converted 8096/8096 bytes
[filter:trace4] [pid 77874] util_expr_eval.c(858): [client 192.168.97.18:65401] Evaluation of expression from /usr/local/etc/apache24/extra/httpd-ruby.conf:130 gave: 1
[filter:trace2] [pid 77874] mod_filter.c(159): [client 192.168.97.18:65401] Expression condition for 'deflate' matched
[xml2enc:debug] [pid 77874] mod_xml2enc.c(476): [client 192.168.97.18:65401] AH01439: xml2enc: consuming 8096 bytes from bucket
[xml2enc:debug] [pid 77874] mod_xml2enc.c(502): [client 192.168.97.18:65401] AH01441: xml2enc: converted 8096/8096 bytes
[xml2enc:debug] [pid 77874] mod_xml2enc.c(476): [client 192.168.97.18:65401] AH01439: xml2enc: consuming 8096 bytes from bucket
[xml2enc:debug] [pid 77874] mod_xml2enc.c(502): [client 192.168.97.18:65401] AH01441: xml2enc: converted 8096/8096 bytes
[deflate:debug] [pid 77874] mod_deflate.c(1622): [client 192.168.97.18:65401] AH01398: Zlib: Inflated 6176 to 28247 : URL /fin-stage/assets/application-3a5821b5be536e0108d5934c96815299001dfa3c1ddff9f39676a3a3126d8190.css
[xml2enc:debug] [pid 77874] mod_xml2enc.c(476): [client 192.168.97.18:65401] AH01439: xml2enc: consuming 3959 bytes from bucket
[xml2enc:debug] [pid 77874] mod_xml2enc.c(502): [client 192.168.97.18:65401] AH01441: xml2enc: converted 3959/3959 bytes
[deflate:debug] [pid 77874] mod_deflate.c(854): [client 192.168.97.18:65401] AH01384: Zlib: Compressed 28247 to 6226 : URL /fin-stage/assets/application-3a5821b5be536e0108d5934c96815299001dfa3c1ddff9f39676a3a3126d8190.css

Currently my snippet it is run for each of these chunks of data
(which is not a good idea, but I didn't hope to be able to understand
the code in its fullness and find a better place). So, with the
DEFLATE walking behind, when it comes to the second chunk, the
DEFLATE will already have put the "gzip" header back in, and so
I watched xml2enc quit in the midst of the document.
Thats why I put that in.

Another minor flaw is that the test for "Content-Encoding: identity"
(btw: does anybody use that?) is probably not case-insensitive.

And then I was thinking about a different and probably better approach:
if we can check the first few bytes of the actual document
beforehand, we can test these against the signatures of the usual
compression-algorithms (in the same way as the "file" command does it
on Unix). This seems more safe than relying on header information.

Because, I don't see a reason why an HTML document might not also be
compressed - and then it wouldn't help to just stop processing CSS
documents.

Btw, concerning this message, I had a look at that one, too:
AH01434: Charset ISO-8859-1 not supported by libxml2; trying apr_xlate

It seems to me that this message is reached just because the document
is compressed (and libxml2 can obviousely not find a charset in
that); only the message text seems misleading.
Maybe a conservative approach would be to just stop at that point
and give up - because, compression might not be the only issue here;
people might get the idea to use some end-to-end encryption for
certain documents, and that would also appear as binary data that we
must not tamper with...
(just thinking along)

cheerio,
PMc