2020-01-10

Performance improvements in ABM

A typical issue (db timeout is set to 60s!) is that patience is required:

> source(here("data-raw/public_data.R"))
Patience, please. It takes a while to fetch the data into the cache.
 Error: nanodbc/nanodbc.cpp:950: HYT00: [Microsoft]
 [ODBC Driver 17 for SQL Server]Login timeout expired
 
> source(here("data-raw/public_data.R"))
Patience, please. It takes a while to fetch the data into the cache.
Updating cached data for public data at: /home/markus/.config/bibmon/public.rds
✔ Setting active project to '/home/markus/repos/bibliomatrix'
✔ Saving 'abm_public_kth' to 'data/abm_public_kth.rda'

Some potential areas of improvement

  • Improve database calls (at backend level - views/SPs/temptables etc)
  • Improve functions (at R package level)
  • Improve caching (at app level)
  • Improve server (more hardware / memory)
  • Evaluate alternative backends (investigate using Apache Spark?)
  • Faster rendering of rmarkdown (see how on this link)

What areas do you see?

Which areas are the most important?

Refactoring db calls and functions

  • How long does calls to abm_table*() take?
  • Everything else equal - differences between backends (mssql vs sqlite3)?
  • When to issue a collect? Should compute() be used? See this link.
  • Any opportunities to index?
  • Other ideas?

Caching

  • Since interactive time (sub-second) is not there -> caching
  • Caching of pre-rendered flexdashboards - a function exists
  • Two kinds of flexdashboards (almost dupes), variants:
    • logged in (SAML auth)
    • not logged in (“anonymous”)

Measuring fcn performance

library(microbenchmark)
library(bibliomatrix)
library(dplyr)

mb <- microbenchmark(times = 1, unit = "s",
  t1 <- abm_table1(con = pool_bib("mssql")),
  t2 <- abm_table1(con = pool_bib("sqlite"))
)

print(mb)
## Unit: seconds
##                                        expr       min        lq      mean
##   t1 <- abm_table1(con = pool_bib("mssql")) 5.4390645 5.4390645 5.4390645
##  t2 <- abm_table1(con = pool_bib("sqlite")) 0.7101889 0.7101889 0.7101889
##     median        uq       max neval
##  5.4390645 5.4390645 5.4390645     1
##  0.7101889 0.7101889 0.7101889     1

Measuring fcn performance

We can measure the times spent inside the abm_table1() when connecting against mssql using profvis:

library(profvis)

perf_profile <- profvis(abm_table1(con = pool_bib("mssql")))

Results are on the next slide

Measuring query perf

We can look at the query sent to the backend and profile it using EXPLAIN etc:

abm_data(unit_code = "KTH") %>% show_query()
## <SQL>
## SELECT *
## FROM `masterfile`
## WHERE (`Unit_code` IN ('KTH'))
#abm_data(unit_code = "KTH") %>% explain()  # for query plan used in db

# sqlite3 db does not have indexes

Adding indexes and tuning a query

# go to the sqlite3 db and start the db CLI
cd ~/.config/bibmon && sqlite3 bibmon.db

# use the dbplyr sql (which uses %in%)
sqlite> .timer on
sqlite> SELECT count(*)
   ...> FROM `masterfile`
   ...> WHERE (`Unit_code` IN ('KTH'));
Run Time: real 0.018 user 0.008986 sys 0.008708

# add an index
sqlite> create index idx1 on Masterfile(Unit_code);
Run Time: real 0.092 user 0.064022 sys 0.016046

sqlite> select count(*) from masterfile where Unit_code = "KTH";
Run Time: real 0.006 user 0.005362 sys 0.000000

SQLite3 db after index

EXPLAIN the query plan (shows that index is being used)

sqlite> .eqp on
sqlite> select count(*) from masterfile where Unit_code in ("KTH");
--EQP-- 0,0,0,SEARCH TABLE masterfile USING COVERING INDEX idx1 (Unit_code=?)
37274
Run Time: real 0.005 user 0.005458 sys 0.000000

Measuring several fcns

library(rlang)
library(purrr)
library(dplyr)

fns <- function(table_no, db_type) 
  sprintf("abm_table%s(con = pool_bib('%s'))", table_no, db_type)

fns_sqlite <- fns(1:5, "sqlite")
fns_mssql <- fns(1:5, "mssql")

mb <- function(f) 
  microbenchmark(times = 1, unit = "s", x <- force(f))$time / 1e9

perf <- function(f) tibble(
    f = f,
    p = map_dbl(f, function(x) mb(eval_tidy(parse_quo(x, env = global_env()))))
  )

res <- bind_cols(
  perf(fns_sqlite),
  perf(fns_mssql)
)

Results on next page.

f p f1 p1
abm_table1(con = pool_bib(‘sqlite’)) 0.8028460 abm_table1(con = pool_bib(‘mssql’)) 5.274094
abm_table2(con = pool_bib(‘sqlite’)) 0.6155447 abm_table2(con = pool_bib(‘mssql’)) 4.547223
abm_table3(con = pool_bib(‘sqlite’)) 0.6655889 abm_table3(con = pool_bib(‘mssql’)) 8.077782
abm_table4(con = pool_bib(‘sqlite’)) 0.7652151 abm_table4(con = pool_bib(‘mssql’)) 5.426737
abm_table5(con = pool_bib(‘sqlite’)) 0.6206377 abm_table5(con = pool_bib(‘mssql’)) 5.392559

Piñatas / TODOs

  • Make abm_table*() fcns faster