adaR is a wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++ .
It implements several auxilliary functions to work with urls:
utils::URLdecode
(~40x
speedup)More general information on URL parsing can be found in the
introductory vignette via vignette("adaR")
.
adaR
is part of a series of R packages to analyse
webtracking data:
You can install the development version of adaR from GitHub with:
# install.packages("devtools")
::install_github("gesistsa/adaR") devtools
The version on CRAN can be installed with
install.packages("adaR")
This is a basic example which shows all the returned components of a URL.
library(adaR)
ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")
#> href protocol username
#> 1 https://user_1:password_1@example.org:8080/api?q=1#frag https: user_1
#> password host hostname port pathname search hash
#> 1 password_1 example.org:8080 example.org 8080 /api ?q=1 #frag
/*
* https://user:pass@example.com:1234/foo/bar?baz#quux
* | | | | ^^^^| | |
* | | | | | | | `----- hash_start
* | | | | | | `--------- search_start
* | | | | | `----------------- pathname_start
* | | | | `--------------------- port
* | | | `----------------------- host_end
* | | `---------------------------------- host_start
* | `--------------------------------------- username_end
* `--------------------------------------------- protocol_end
*/
It solves some problems of urltools with more complex urls.
::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.
urltools 7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#> scheme domain port
#> 1 https 40.7519848,-74.0015045,14.\n 7z <NA>
#> path
#> 1 data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#> parameter fragment
#> 1 <NA> <NA>
ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m
5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#> href
#> 1 https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m 5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#> protocol username password host hostname port
#> 1 https: www.google.com www.google.com
#> pathname
#> 1 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m 5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#> search hash
#> 1
A “raw” url parse using ada is extremely fast (see ada-url.com) but for this to carry
over to R is tricky. The performance is still compatible with
urltools::url_parse
with the noted advantage in accuracy in
some practical circumstances.
::mark(
benchada = ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE),
urltools = urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag"),
check = FALSE
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 ada 158µs 165µs 5913. 0B 45.3
#> 2 urltools 104µs 108µs 8488. 0B 42.6
For further benchmark results, see benchmark.md
in
data_raw
.
There are four more groups of functions available to work with url parsing:
ada_get_*()
get a specific componentada_has_*()
check if a specific component is
presentada_set_*()
set a specific component from URLSada_clear_*()
remove a specific component from
URLSpublic_suffix()
extracts their top level domain from the
public suffix list,
excluding private domains.
<- c(
urls "https://subsub.sub.domain.co.uk",
"https://domain.api.gov.uk",
"https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)public_suffix(urls)
#> [1] "co.uk" "gov.uk"
#> [3] "butthisispartoftheps.kawasaki.jp"
If you are wondering about the last url. The list also contains
wildcard suffixes such as *.kawasaki.jp
which need to be
matched.
The logo is created from this portrait of Ada Lovelace, a very early pioneer in Computer Science.