Chapter 5 ojodb

5.1 Getting Started

5.1.1 Installation

5.1.2 Authorization

5.2 Basic Functionality

The basic function that you’ll use in {ojodb} is ojo_tbl. This retrieves a lazy tibble of data corresponding to a table in our database. The default behavior is to return a table from the public schema, i.e. OSCN and ODCR sourced data. The schema argument can alternatively be supplied to access data sourced from other location.

ojo_tbl("case")
ojo_tbl("arrest", schema = "iic")

To list all schemas available use ojo_list_schemas.

ojo_list_schemas()

## # A tibble: 9 × 1
##   schema            
##   <chr>             
## 1 ojo_oscn          
## 2 ojo_iic           
## 3 ojo_odoc          
## 4 public            
## 5 eviction_addresses
## 6 doc_tracker       
## 7 ocdc              
## 8 odoc              
## 9 iic

Once you know the schema you are interested in you can use ojo_list_tables to list all tables in the schema.

ojo_list_tables(schema = "iic")

## # A tibble: 3 × 2
##   schema table  
##   <chr>  <chr>  
## 1 iic    inmate 
## 2 iic    arrest 
## 3 iic    offense

Similarly, once you know the table you are interested in, you can list all its variables using ojo_list_vars.

ojo_list_vars("offense", "iic")

## character(0)

Rarely do we want to gather all data from a table. This is not only often irrelevant to the research question at hand, but also takes significant time and RAM. Rather, we opt to filter the data as much as possible before using the collect function to gather the data on our local machines. For instance, we can gather all criminal felonies in the Tulsa County court system for 2020, making sure to perform all filtering before collecting the data.

ojo_table("case") |>
  filter(year = 2020,
         case_type = "CF",
         district = "TULSA") |>
  collect()

5.2.1 Helper Functions

This section provides an introduction to those helper functions fully developed and tested. More are likely on their way, so check back here for an overview of how to use new functions. If you’d prefer technical documentation, including of function parameters and options, see the {ojodb} package documentation in RStudio’s ‘Help’ panel or with ??ojodb.

5.2.1.1 Tables

The majority of research questions we explore utilize the information in the public schema, and we perform these operations quite often. Some helper function have been created to streamline the process of filtering.

The first is ojo_crim_cases, which queries for cases with criminal case types matching the other arguments in the function.

ojo_crim_cases(
  districts = c("TULSA", "OKLAHOMA"),
  file_years = 2018:2020
) |>
  collect()

The second is ojo_civ_cases. This function performs much the same way, but defaults to querying the most commonly used civil case types.

ojo_civ_cases(
  districts = "ALFALFA",
  file_years = 2020
) |>
  collect()

Default behavior for both of the above functions is to return only the minimum number of relevant columns or variables needed. You can supply the argument vars as either a vector of variables names, e.g. vars = c("date_filed"), or using the keyword vars = "all". When supplying a vector of variable names, note that a minimum number of pre-selected variables will be included in the returned lazy tibble such that each row is unique. Using the keyword "all", unsurprisingly returns all columns available in a given table.

5.2.1.2 Joins

Often, the research question at hand will require multiple data tables to answer. Suppose our research interest is the fines and fees associated with criminal misdemeanor cases in Tulsa County over time, averaged over the course of a year. One way to gather the necessary data would be to first filter the case table for cases where case_type == "CM", then perform a left join to the minute table on "id" = "case_id". Finally, we could use the amount column to calculate monetary charges, filtering the minutes by the description field as necessary.

{ojodb} includes a number of ojo_add_ prefixed functions which will perform certain joins for you. For instance, using the example research question above, we can pull the same information using the following method:

ojo_tbl("case") |>
  filter(district == "TULSA",
         case_type == "CM") |>
  ojo_add_minutes()

Other ojo_add_ prefixed functions include: - ojo_add_issues() - ojo_add_counts() - … more to come!

Both the issues and counts joining functions are baked into the ojo_civ_cases() and ojo_crim_cases() functions for ease of use. When possible, using a pre-made _cases() function should be preferred to an ojo_tbl("case") call.

The returned object has a data source of PQConnection, meaning it is a lazy tibble. It must be collected before being used with certain functions. Though, do try and perform as many transformations as possible before collecting. Several {dplyr} transformations like mutate are implemented for database-connected lazy tibbles, for example.

Keep in mind that the source of the two tables R must be the same to perform a join. This means that you must collect() both tables prior to the join, or collect() neither of them until after the join. Usually performing collect() after the join is the way to go, since it adheres to the principle of collecting the least amount of data needed. However, a limited number of more complex data transformations require the data source to be a local tibble loaded into memory. In some of these cases the data and the memory usage of the computation will exceed that of your local machine. Depending on capacity and priority it is possible to perform the transformation on Google Cloud Platform.

5.2.1.3 {ggplot} Theme

{ojodb} has built in themeing for {ggplot}. Simply append the function ojo_theme() to the end of a plot.

data |>
ggplot(aes(x = date, y = val)) +
    geom_line() +
    ojo_theme()

Note that ojo_theme() handles spacing, font, and linetypes. It does not add a fill or color specification by default. Use ojo_fill() or ojo_color() respectively.

data |>
ggplot(aes(x = date_filed, y = n, color = district)) +
    geom_line() +
    ojo_theme() +
    ojo_color()

You can also set these as global options so you don’t have to apply them to every plot in your script.

theme_set(ojo_theme())