Chapter 5 ojodb
5.1 Getting Started
5.1.1 Installation
5.2 Basic Functionality
The basic function that you’ll use in {ojodb}
is ojo_tbl
. This retrieves a lazy tibble of data corresponding to a table in our database. The default behavior is to return a table from the public
schema, i.e. OSCN and ODCR sourced data. The schema
argument can alternatively be supplied to access data sourced from other location.
ojo_tbl("case")
ojo_tbl("arrest", schema = "iic")
To list all schemas available use ojo_list_schemas
.
ojo_list_schemas()
## # A tibble: 9 × 1
## schema
## <chr>
## 1 ojo_oscn
## 2 ojo_iic
## 3 ojo_odoc
## 4 public
## 5 eviction_addresses
## 6 doc_tracker
## 7 ocdc
## 8 odoc
## 9 iic
Once you know the schema you are interested in you can use ojo_list_tables
to list all tables in the schema.
ojo_list_tables(schema = "iic")
## # A tibble: 3 × 2
## schema table
## <chr> <chr>
## 1 iic inmate
## 2 iic arrest
## 3 iic offense
Similarly, once you know the table you are interested in, you can list all its variables using ojo_list_vars
.
ojo_list_vars("offense", "iic")
## character(0)
Rarely do we want to gather all data from a table. This is not only often irrelevant to the research question at hand, but also takes significant time and RAM. Rather, we opt to filter the data as much as possible before using the collect
function to gather the data on our local machines. For instance, we can gather all criminal felonies in the Tulsa County court system for 2020, making sure to perform all filtering before collecting the data.
ojo_table("case") |>
filter(year = 2020,
case_type = "CF",
district = "TULSA") |>
collect()
5.2.1 Helper Functions
This section provides an introduction to those helper functions fully developed and tested. More are likely on their way, so check back here for an overview of how to use new functions. If you’d prefer technical documentation, including of function parameters and options, see the {ojodb}
package documentation in RStudio’s ‘Help’ panel or with ??ojodb
.
5.2.1.1 Tables
The majority of research questions we explore utilize the information in the public
schema, and we perform these operations quite often. Some helper function have been created to streamline the process of filtering.
The first is ojo_crim_cases
, which queries for cases with criminal case types matching the other arguments in the function.
ojo_crim_cases(
districts = c("TULSA", "OKLAHOMA"),
file_years = 2018:2020
|>
) collect()
The second is ojo_civ_cases
. This function performs much the same way, but defaults to querying the most commonly used civil case types.
ojo_civ_cases(
districts = "ALFALFA",
file_years = 2020
|>
) collect()
Default behavior for both of the above functions is to return only the minimum number of relevant columns or variables needed. You can supply the argument vars
as either a vector of variables names, e.g. vars = c("date_filed")
, or using the keyword vars = "all"
. When supplying a vector of variable names, note that a minimum number of pre-selected variables will be included in the returned lazy tibble such that each row is unique. Using the keyword "all"
, unsurprisingly returns all columns available in a given table.
5.2.1.2 Joins
Often, the research question at hand will require multiple data tables to answer. Suppose our research interest is the fines and fees associated with criminal misdemeanor cases in Tulsa County over time, averaged over the course of a year. One way to gather the necessary data would be to first filter the case
table for cases where case_type == "CM"
, then perform a left join to the minute
table on "id" = "case_id"
. Finally, we could use the amount
column to calculate monetary charges, filtering the minutes by the description
field as necessary.
{ojodb}
includes a number of ojo_add_
prefixed functions which will perform certain joins for you. For instance, using the example research question above, we can pull the same information using the following method:
ojo_tbl("case") |>
filter(district == "TULSA",
== "CM") |>
case_type ojo_add_minutes()
Other ojo_add_
prefixed functions include:
- ojo_add_issues()
- ojo_add_counts()
- … more to come!
Both the issues
and counts
joining functions are baked into the ojo_civ_cases()
and ojo_crim_cases()
functions for ease of use. When possible, using a pre-made _cases()
function should be preferred to an ojo_tbl("case")
call.
The returned object has a data source of PQConnection, meaning it is a lazy tibble. It must be collected before being used with certain functions. Though, do try and perform as many transformations as possible before collecting. Several {dplyr}
transformations like mutate
are implemented for database-connected lazy tibbles, for example.
Keep in mind that the source of the two tables R must be the same to perform a join. This means that you must collect()
both tables prior to the join, or collect() neither of them until after the join. Usually performing collect()
after the join is the way to go, since it adheres to the principle of collecting the least amount of data needed. However, a limited number of more complex data transformations require the data source to be a local tibble loaded into memory. In some of these cases the data and the memory usage of the computation will exceed that of your local machine. Depending on capacity and priority it is possible to perform the transformation on Google Cloud Platform.
5.2.1.3 {ggplot} Theme
{ojodb} has built in themeing for {ggplot}. Simply append the function ojo_theme()
to the end of a plot.
|>
data ggplot(aes(x = date, y = val)) +
geom_line() +
ojo_theme()
Note that ojo_theme()
handles spacing, font, and linetypes. It does not add a fill or color specification by default. Use ojo_fill()
or ojo_color()
respectively.
|>
data ggplot(aes(x = date_filed, y = n, color = district)) +
geom_line() +
ojo_theme() +
ojo_color()
You can also set these as global options so you don’t have to apply them to every plot in your script.
theme_set(ojo_theme())