Fork me on GitHub

Fetching Potentially Compressed Webpages That Might Not be UTF-8 Encoded in Rust

TL;DR This write-up explains how I build my first Rust crate to fetch compressed, non-UTF-8 encoded webpages. If you want to get straight to the code grab the final fetch crate on crates.io or view the source on Github.

For a side-project I needed to download some arbitrary webpages (HTML or plain text only). In absence of a targeted shipping date, I decided to use this opportunity to write this piece of software in Rust which I have interestedly observed for a long time but never actually used.

Preparation

Since there is no http client in the Rust standard library I picked the currently most popular http library on crates.io: hyper.

The basic client example looked fine initially but quickly turned out to be insufficient when testing with a variety of sites. Many sites send compressed or non-UTF-8 encoded data, and while the former could often be solved by setting the Accept-Encoding header to identity, the latter required some more changes to the program since strings are always UTF-8 encoded in Rust and thus needed to be converted.

So I expanded my list of dependencies to the following set of crates:

Implementation

First let's look at how easily the current implementation make it to fetch a pages body as String:

let body = fetch::fetch_page("https://www.rust-lang.org/en-US/");

The final project used for the prototype and this blog post is available on Github at tp/fetch-rs. It has also been publish as the fetch crate on crates.io.

Like alluded to above, I started out with a simple hyper client that worked fine for the first few manual fetches. I could just read the response into a string using:

let mut body_buffer = String::new();

response.read_to_string(&mut body_buffer);

But this would fail when the page was not encoded in UTF-8 or would just return garbled output for compressed content.

So I went looking for an encoding and a decompression library, which let me to encoding and flate2.

Now things became a little more involved since I had to explore the 2 new crates, which had to be done differently from how I would normally approach this, since full autocompletion for crates is not yet available. But thanks to the pervasive use of rustdoc there is a good amount of documentation on every crate that I encountered.

So after a while of reading the API documentations, I was able to unzip the response body

let mut unzipped_body_buffer = Vec::new();
GzDecoder::new(body_buffer.as_slice());
d.read_to_end(&mut unzipped_body_buffer);

and also convert the given charset to UTF-8:

let decoder = encoding_from_whatwg_label(charset).unwrap();
return decoder.decode(&unzipped_body_buffer, DecoderTrap::Strict)

[Please note that the examples above are abbreviated excerpts from the actual source code and do not contain proper error handling. Please look at the source to see how errors are currently handled.]

Once my spot checks succeeded, I wanted to try the Rust testing facilities to ship some tests with the library. As it turns out, testing your code is super easy: Just annotate test functions in your library with #[test] and run cargo test.

Currently I just check whether the fetch returns a success value, but as a next step I want to setup a test server and compare the whole response body to a reference file.

#[test]
fn fetch_deflate_compressed_page() {
    fetch_page("http://httpbin.org/deflate").expect("Fetch to succeed");
}

After all tests passed[^1] I was satisfied with the initial scope of library and went on to publish it to crates.io.

This turned out to be a little bit trickier than it probably should have been, due to me initially naming the crate “http-fetch”.

When integration testing the crate with another project I soon found out that while your crate name on crates.io can contain a hyphen, you have to import it with a different name when using it in Rust code. Not wanting to explain this (confusing) behavior in the README I decided to rename the crate (which I had already published). Since it was not possible to rename from “http-fetch” to “http_fetch” for some reason (cargo would not allow this), I created a new crate named “fetch”.

Retrospection

Though this project is small and in a functional state right now, there are already a number of points that I need to dig into deeper before writing more complex programs in Rust:

Error handling: In order to return errors from fetch using the Result type which requires a common interface for all possible errors I am currently transforming all errors and replace them with a string containing a short description of where the error occurred. Sadly this looses the original error, which the call side might want to evaluate to help with debugging etc.

Borrowing and Ownership: When reading into the buffer a second time I would get error E0506 reported by the compiler. While I don't yet fully understand the reason for this, the suggested solution of just scoping offending code in its on { } block works.

A lack of understanding lifetimes has also so far prevented me from breaking the code up into smaller functions, which would require some special annotations (I assume).

All in all I was highly rewarded for this experiment by being exposed to new concepts and different approaches to API design and programming altogether. Furthermore I was positively surprised how quick I could pursue and find my way around the crates given the lack of full code completion. Thanks to great documentation being available for all crates used, this was not as big of a slowdown as I would have expected initially.

Now I am looking forward to writing more code in Rust and get a better understanding of and write future-proof solutions for the issues mentioned above.

[^1]: Which was did not succeed immediately, since I had an error with DEFLATE compressed responses. Turns out you have to run them through the ZlibDecoder instead of the DeflateDecoder which I assumed at first (given the name). So, lesson learned, always test your code 🤗