View Source BrowseyHttp (BrowseyHttp v0.0.4)
BrowseyHttp is a browser-imitating HTTP client for scraping websites that resist bot traffic.
Browsey aims to behave as much like a real browser as possible, short of executing JavaScript. It's able to scrape sites that are notoriously difficult, including:
- Amazon
- TicketMaster
- LinkedIn (at least for the first few requests per day per IP, after which even real browsers will be shown the "auth wall")
- Real estate sites including Zillow, Realtor.com, and Trulia
- OpenSea
- Sites protected by Cloudflare
- Sites protected by PerimeterX/HUMAN Security
- Sites protected by DataDome, including Reddit, AllTrails, and RealClearPolitics
Plus, as a customer of Browsey, if you encounter a site Browsey can't scrape, we'll make a best effort attempt to get a fix for you. (Fully client-side rendered sites, though, will still not be supported.)
Note that when scraping, you'll need to be mindful of both the IPs you're scraping from and how many requests you're sending to a given site. Too much traffic from a given IP will trip rate limits even if you were using a real browser. (For instance, if you try to scrape any major site within your CI system, it's almost guaranteed to fail. A shared IP on a cloud server is iffy as well.)
Why BrowseyHttp?
Browsey versus other HTTP clients
Because Browsey imitates a real browser beyond just faking a user agents, it is able to scrape vastly more sites than a default-configured HTTP client like HTTPoison, Finch, or Req, which get blocked by Cloudflare and other anti-bot measures.
Browsey versus Selenium, Chromedriver, Playwright, etc.
Running a real, headless web browser is the gold standard for fooling bot detection, and it's the only way to scrape sites that are fully client-side rendered. However, running a real browser is extremely resource-intensive; it's not uncommon to encounter a site that will cause Chromedriver to use 6 GB of RAM or more. Headless browsers are also quite a bit slower than Browsey, since you end up waiting for the page to render, execute JavaScript, etc.
Worst of all, headless browsers can be unreliable. If you run a hundred requests, you'll encounter at least a few that fail in ways that aren't related to the site you're scraping having issues. Chromedriver may simply fail to respond to your commands for reasons that are impossible to diagnose. It may time out waiting for JavaScript to finish executing, and of course browsers can crash.
In contrast, Browsey is extremely reliable (it's too simple to fail in complicated ways like
browsers do!), and it requires virtually no resources beyond the memory needed to store
the response data. It also has built-in protections to ensure memory usage doesn't
spiral out of control (see the :max_response_size_bytes
option to BrowseyHttp.get/2
).
Finally, Browsey is quite a bit faster than a headless browser.
Browsey versus a third-party scraping service like Zyte, ScrapeHero, or Apify
Third-party scraping APIs are billed as a complete, no-compromise solution for web scraping, but they often have reliability problems. You're essentially paying someone else to run a headless browser for you, but they're subject to the same issues as the headless browsers themselves in terms of reliability. It doesn't feel great to pay the high prices of a scraping service only to get back a failure unrelated to the site you're scraping being down.
Because of its reliability, flat monthly price, and low resource consumption, Browsey makes a better first choice for your scraping needs. Then you can fall back to expensive third-party APIs when you encounter a site that really needs a headless browser.
Summary
Functions
Performs an HTTP GET request for a single resource, limiting the size we process to protect the server.
Performs an HTTP GET request for a resource plus any embedded resources (CSS, JavaScript, images, etc.).
Same as BrowseyHttp.get_with_resources/2
, but when the primary result succeeds, returns a stream of responses.
Types
@type browser() :: :chrome | :chrome_android | :edge | :safari
@type get_result() :: {:ok, BrowseyHttp.Response.t()} | {:error, Exception.t()}
@type http_get_option() :: {:follow_redirects?, boolean()} | {:max_retries, non_neg_integer()} | {:max_response_size_bytes, non_neg_integer() | :infinity} | {:receive_timeout, timeout()} | {:browser, browser() | :random} | {:ignore_ssl_errors?, boolean()}
@type resource_responses() :: [BrowseyHttp.Response.t() | Exception.t()]
Functions
@spec default_browser(uri_or_url()) :: browser()
@spec get(uri_or_url(), [http_get_option()]) :: get_result()
Performs an HTTP GET request for a single resource, limiting the size we process to protect the server.
Note that to fully imitate a browser, you may want to instead use
BrowseyHttp.get_with_resources/2
to retrieve both the page itself and its
embedded resources (CSS, JavaScript, images, etc.) at once.
Options
:max_response_size_bytes
: The maximum size of the response body, in bytes, or:infinity
. If the response body exceeds this size, we'll return aTooLargeException
. This is important so that unintentionally downloading, say, a huge video file doesn't run your server out of memory. Defaults to 5,242,880 (5 MiB).:follow_redirects?
: whether to follow redirects. Defaults to true, in which case the complete chain of redirects will be tracked in theBrowseyHttp.Response
struct's:uri_sequence
field.:max_retries
: how many times to retry when the HTTP status code indicates an error. Defaults to 0.:receive_timeout
: The maximum time (in milliseconds) to wait to receive a response after connecting to the server. Defaults to 30,000 (30 seconds).:browser
: One of:chrome
,:chrome_android
,:edge
,:safari
, or:random
. Defaults to:chrome
, except for domains known to block our Chrome version, in which case a better default will be chosen.:ignore_ssl_errors?
: If true, we won't produce anSslException
when the SSL handshake fails. This can be useful when the remote server has a root certificate that is unknown to the browser (including self-signed certificates). Use with caution, of course. Defaults to false.
Examples
iex> case BrowseyHttp.get("https://www.example.com") do
...> {:ok, %BrowseyHttp.Response{body: body}} -> String.slice(body, 0, 15)
...> {:error, exception} -> exception
...> end
"<!doctype html>"
@spec get!(uri_or_url(), [http_get_option()]) :: BrowseyHttp.Response.t() | no_return()
@spec get_with_resources(uri_or_url(), [http_get_option() | resource_option()]) :: {:ok, [BrowseyHttp.Response.t() | resource_responses()]} | {:error, Exception.t()}
Performs an HTTP GET request for a resource plus any embedded resources (CSS, JavaScript, images, etc.).
This matches how a real browser fetches a page by retrieving the resources in parallel.
On success, the first of the returned response structs will always be the initial HTML page.
If the initial HTML page fails to load, we'll return an error tuple. However, if any of the
embedded resources fail to load entirely (that is, they don't merely return an HTTP error
like a 404, but they would cause an :error
return from BrowseyHttp.get/2
, such as a
no-such-domain error or a timeout), they'll simply be left out of the returned response list.
If the initial resource we retrieve is not HTML, on success we'll return an ok tuple with a single response struct.
Options
- Control the individual requests using the same options as
BrowseyHttp.get/2
. :ignore_uris
: An enumerable of URI structs that we will skip fetching when they are referenced as resources. You can use this to do things like avoid re-crawling images that are present in the header of every page. Defaults to the empty set.:fetch_images?
: Whether to fetch images referenced in<img>
and<link rel="icon">
tags. Defaults to true.:fetch_css?
: Whether to fetch CSS files referenced in<link rel="stylesheet">
tags. Defaults to true.:fetch_js?
: Whether to fetch JavaScript files referenced in<script>
tags. Defaults to true.:load_resources_when_redirected_off_host?
: If false, we'll skip crawling resources if the URL redirects to a different host. Defaults to false to prevent unintentionally loading resources from a site you didn't expect.
@spec stream_with_resources(uri_or_url(), [http_get_option() | resource_option()]) :: {:ok, Enumerable.t(BrowseyHttp.Response.t() | Exception.t())} | {:error, Exception.t()}
Same as BrowseyHttp.get_with_resources/2
, but when the primary result succeeds, returns a stream of responses.
As with the non-streaming version, the first response will always be the initial resource.