Introduction

The purpose of this file is to document the features present in the package of functions developed for the purpose of making exploration of the ONS Quarterly Labour Force Survey using the R programming language and to present some examples of data vis using the data obtained via these functions

The Main Functions

queryLFS()

If there is no record of the query being requested previously then R will summarise the request including showing quarters/years the variables are available for (if in varinfo db) Progress updates as each raw dataset is summarised

This is the most important function in the package, it is the function that allows one to specify the criteria of a query and run it across as many time-periods that data is available for, or as many as wanted.

Data for all available time points

The following is an example call to the queryLFS() function which will be explained in detail below:

queryLFS(MainVar="FLEXW7", outcome="sum",grouping="SEXL;nAGE",filtering="FLEXW7 == 1")

This is an example query that one would run if interested in collating data for all available quarters that provides a sum of the number of people where the value of the FLEXW7 variable in the LFS is equal to 1 split by two grouping variables; SEXL and nAGE. The LFS documentation will tell you that the FLEXW7 variable denotes whether the respondent’s main job is based on the terms of a zero-hours contract, with 1 being the code representing ‘yes’. The grouping variables are more straight-forward to interpret:

SEXL - Sex of respondent
nAGE - Age in 5-year bands (note: 16-19)

There are other optional argument to the queryLFS() function that allow you to specify which year/quarter combos to extract data for (example to follow). When these arguments are not specified by the user then the function will attempt to work out what years/quarters this data is available for and then collect them all. This works when the variables involved in the request (in this case, FLEXW7, SEXL, nAGE) are all included in the varinfo database which can be viewed by executing the create_variableInfo_df() function. The following table is a subset of the varinfo database:

mainvar yearStarted quarterStarted yearEnded quarterEnded Q1 Q2 Q3 Q4
FLEXW7 2002 1 NA NA 0 1 0 1
FLEXW7L 2002 1 NA NA 0 1 0 1
INECAC05 2005 2 NA NA 1 1 1 1
INECAC05L 2005 2 NA NA 1 1 1 1
INECACR 1992 2 2005 1 1 1 1 1
INECACRL 1992 2 2005 1 1 1 1 1

If all the variables the user is interested in are documented in the varinfo database then a request for all available data is made easy.

However, the function has been written to be more general. The use of the varinfo database to know what quarters/years include what data is a helper feature but not necessary. When using variables that are not included in the varinfo database the user needs to be more aware of the documentation of the LFS variables they are interested in and ensure they are only asking for time periods when the variables are actually present in the data (can be more difficult than it ought to be).

Data for specific time periods

A common situation would be to interrogate just the latest LFS dataset to gain an understanding of the labour force as it is in the most recently available figures.

queryLFS(MainVar="FLEXW7",
                        outcome="sum",
                        grouping="INDE07ML",
                        filtering="FLEXW7 == 1",
                        yearstart=2017,
                        yearend=2017,
                        quarters=c(4))

The INDE07ML variable codes the industry sector the respondent’s main job is a part of therefore the basis of this query is to get the number of people with zero-hour jobs grouped by employment sector. In this query we have specified some additional arguments; yearstart, yearend, quarters which instructs the function to only obtain the data for the Oct-Dec quarter for 2017 which at the time of writing was the most recent release of the LFS.

To summarise, here is a list of the possible arguments to queryLFS() and a description:

  • MainVar = Name of the variable you consider the primary part of the query, the dependent variable.
  • outcome = The outcome/stat interested in. Often this is a count of the people who belong to categories so “sum” should be used to call sum() on the requested data. An alternative example may be an average e.g. arithmetic mean of most recent weekly wage, so “mean” would be used to call mean() on the requested data.
  • yearstart = Earliest year to start requesting data for the query.
  • yearend = Final year to request data for the query.
  • quarters = Which quarters to include in the query given as a list e.g. c(2,4),c(4),c(1,2,3,4)
    • 1 = Jan-Mar
    • 2 = Apr - Jun
    • 3 = Jul - Sep
    • 4 = Oct - Dec
  • filtering = List of filtering conditions seperated by “;”.
  • grouping = List of factors to group by separated by “;”.

Cached Results & Execution time

As the number of different quarterly LFS datasets needed to be analysed by a call to queryLFS() increases so does the time taken to execute the query, quite significantly. As a way to make this more convenient the function creates a cache of queries computed and saves the resulting dataframes as R-Objects to be reused later. At the start of any query the function will first compare the criteria set in the request with the cached requests and see if there is a match. If the current request matches a previous request then the cached result is provided removing the need to process the same data in the same way repeatedly. Additionally, if your request involves a set of groups which is a subset of the grouping of a previous request, or is a subset of the time periods of the previous request, then the function uses the more nested cached result to regroup based on the current request and return that result.

Take the following requests for example:

zhdata_SA <- queryLFS(MainVar="FLEXW7", outcome="sum",grouping="SEXL;nAGE",filtering="FLEXW7 == 1")
zhdata_S <- queryLFS(MainVar="FLEXW7", outcome="sum",grouping="SEXL",filtering="FLEXW7 == 1")

For the first request to create zhdata_SA which is grouped by SEXL and nAGE for all available time points the function needs to read in all the available data and carry out the analysis. This will take some time, updates on the progress of the analysis is printed to console. Fortunately, after deciding we want the numbers only for each sex at each time point we can run the second command to get zhdata_S. This time, the function will not need to read in all the raw data again as the appropriate result is crafted by regrouping using the cached result when creating zhdata_SA

R response if request is cached due to previous request

R response if request is cached due to previous request

calc_pct_change()

This function takes a dataframe that is of the form of the result of a queryLFS() result and calculates a % change from the previous quarter for the specified variable and adds this to the input dataframe. This would work with any data as long as there is a ‘year’ and ‘quarter’ variable which is present in any result from queryLFS().

The arguments:

  • dat = The data frame object
  • dv = The name of the variable to calculate the % change from previous timepoint.
  • groups = A list of the factors to group by separated by “;”
  • dp = Number of decimal places to round the results to, defaults to 3.

Example:

zhdata_S <- queryLFS(MainVar="FLEXW7", outcome="sum",grouping="SEXL",filtering="FLEXW7 == 1")
kable(head(data.frame(zhdata_S)))
SEXL quarter year Q_Year Outcome
Male 2 2002 2002 Q2 55667
Female 2 2002 2002 Q2 45570
Male 4 2002 2002 Q4 53638
Female 4 2002 2002 Q4 47838
Male 2 2003 2003 Q2 58019
Female 2 2003 2003 Q2 44256
zhdata_S <- calc_pct_change(dat = zhdata_S, dv = "Outcome", groups="SEXL")
kable(head(data.frame(zhdata_S)))
SEXL quarter year Q_Year Outcome pct.chg
Male 2 2002 2002 Q2 55667 NA
Female 2 2002 2002 Q2 45570 NA
Male 4 2002 2002 Q4 53638 -3.645
Female 4 2002 2002 Q4 47838 4.977
Male 2 2003 2003 Q2 58019 8.168
Female 2 2003 2003 Q2 44256 -7.488

The initial data (zhdata_S) is the result of a call to queryLFS() calculating the sum of people whose main job is on the basis of a zero-hours contract for each sex for each quarter that data is available in the LFS. The call to calc_pct_change adds the pct.chg variable which gives that groups % change from the previous time point for that specific group (i.e. row 3 shows the % change in the number of Males with 0hr contracts at Q4 2002 versus the previous quarter for which data is available, Q2 2002.)

get_yoy_pct_change()

Similar to the above function but specifically gets the % change from the previous quarter and the previous year if the data is available.

Example:

zhdata_S_yoy <- get_yoy_pct_change(dat = zhdata_S, dv = "Outcome", groups="SEXL")
kable(data.frame(zhdata_S_yoy))
SEXL yoy qoq
Male -4.289 NA
Female 6.746 NA

The result of this function is a new dataframe with yoy and qoq variables for ‘year-on-year’ and ‘quarter-on-quarter’ which give the % change for each group (if groups requested) from the previous year’s data and from the previous quarter (if available). In this example there is no data for the previous quarter as the FLEXW7 variable is only available in Q2 and Q4. When variables are only present bi-annually like this you could just use the calc_pct_change() function if you wish to see the % change from the previously available data (i.e. the data 6 months previous).

Using the functions to explore the LFS - Examples

Zero Hour Contracts

First I am going to run several queries using the queryLFS() function. I’m going to run the query with the highest lowest level grouping, so that further queries can use the initial result without needing to process the raw data again. In the first group I will extract summary information regarding the number of members of the labour force whom work on the basis of a zero-hours contract in their main job grouped by various factors. The second set of queries request the number of individuals in each of the 4 major economic activity categories for the purpose of combining so that the number of zero-hour contract positions can be represented as a proportion of the active labour force.

# Number of people whose main job is on the terms of a zero-hour contract split by sex 
# and into 5-year age groups for every quarter the data is available (All variables are 
# in the varinfo database so it doesn't need year/quarters specifying).
zhdata_SA <- queryLFS(MainVar="FLEXW7", outcome="sum",grouping="SEXL;nAGE",filtering="FLEXW7 == 1")
# Number of people whose main job is on the terms of a zero-hour contract split by just sex
zhdata_S <- queryLFS(MainVar="FLEXW7", outcome="sum",grouping="SEXL",filtering="FLEXW7 == 1")
# Number of people whose main job is on the terms of a zero-hour contract split just by age
zhdata_A <- queryLFS(MainVar="FLEXW7", outcome="sum",grouping="nAGE",filtering="FLEXW7 == 1")
# Number of people whose main job is on the terms of a zero-hour contract
zhdata <- queryLFS(MainVar="FLEXW7", outcome="sum",filtering="FLEXW7 == 1")
# Number of people in each of the four major employment status categories (ILODEFR 
# variable in the LFS) split by sex and into 5-year age groups for every quarter the 
# data is available (All variables are in the varinfo database so it doesn't need 
# year/quarters specifying).
empdata_SA <- queryLFS(MainVar="ILODEFR", outcome="sum",grouping="SEXL;nAGE;ILODEFRL")
empdata_S <- queryLFS(MainVar="ILODEFR", outcome="sum",grouping="SEXL;ILODEFRL")
empdata_A <- queryLFS(MainVar="ILODEFR", outcome="sum",grouping="nAGE;ILODEFRL")
empdata <- queryLFS(MainVar="ILODEFR",outcome="sum",grouping="ILODEFRL")

Overall

# Interactive Version
# Need to construct time-series objects in xts() class and pass to dyGraph().
zhc_overall_plot <- dygraph(data=xts(zhdata$zhPC,zhdata$Q_Year), 
                            main="Percentage of the economically active who work 
                            on a zero-hour contract in their main job", ylab="%") %>% 
  dyRangeSelector() %>% 
  dyOptions(fillGraph=TRUE, drawGrid=FALSE, drawPoints=TRUE, pointSize=3) %>% 
  dyLegend(width=400)
zhc_overall_plot

For this first example I have produced a static time-series plot using the ggplot package and an interactive version using the dygraphs package. The interactive version is of more use when viewed on a web-platform given the added functionality of being able to hover over points for exact values and manipulating x-axis range etc. However, if producing a report for a PDF then a static plot is required and ggplot provides a suitable framework for these. For the rest of this example report/documentation interactive plots will take priority. Example code printed for these initial examples but will be omitted henceforth.

Split By Sex

Split By Age Group

Note: with the busy nature of a plot with this number of lines it is useful to hover over the line and/or points interested in, this will focus the points at that particular time and/or the particular line selected.

Split By Sex/Age Group

There are different ways to visualise the results of these queryLFS results when the outcome is grouped by more than one variable. Static example using facet_grid() from ggplot and interactive using dygraphs are included here but perhaps a better visualisation exists for such data.