Categories
FSMA R

Extracting data from PDF files using R

Purpose

This post illustrates how R can be used to extract numerical information from PDF files. The technique is illustrated by extracting the Credit Risk Premium and Valuation Uncertainty parameters from the PDF file containing the instructions for the PRA’s 2021 Solvency II Quantitative Impact Study (QIS).

(Note that since this post was written, the PRA have updated the QIS instructions. This has changed the pagination of the PDF file slightly. The code below has not been updated for this.)

This post is partly to illustrate how to extract data from PDF files, and partly preparation for a later one discussing the QIS in more detail, in particular the draft redesign of the Fundamental Spread. See this post for a general introduction to the existing Fundamental Spread.

Introduction

PDF stands for Portable Document Format. This format is designed for presentation, principally to look the same on-screen as when printed out, and so some work is needed to extract data from it.

Copying and pasting tables from PDF files into Word or Excel often does not work very well and needs manual adjustment, for example using Excel’s Convert Text to Columns feature.

The R library pdftools reads PDF files and does a great deal of work behind the scenes to represent their content as text, which provides an alternative to copying from a PDF viewer.

Using R is more auditable than manually copying/pasting, because all the code used is stored for later examination (it can be thought of as a ‘recording’ of the manipulation). Being code-based, it can in principle be automated for use-cases where PDFs in the same overall format are regularly updated (this does not apply to the QIS case study, but the point holds in general).

Download and open the PDF file

The calibrations tested in the PRA QIS are included in the ‘instructions’ PDF file at: https://www.bankofengland.co.uk/-/media/boe/files/prudential-regulation/solvency-ii/solvency-ii-reform-quantitative-impact-survey/qis-instructions.pdf.

The instructions file is downloaded to the working directory using R as follows. Note that the mode must be set to "wb" on Windows to download as a binary file, to avoid file corruption issues from treating it as text.

url <- "https://www.bankofengland.co.uk/-/media/boe/files/prudential-regulation/solvency-ii/solvency-ii-reform-quantitative-impact-survey/qis-instructions.pdf"
pdf_file <- "qis-instructions.pdf"
download.file(url, pdf_file, mode = "wb")

First we use the pdftools::pdf_info function to look at some basic information:

pdf_i <- pdftools::pdf_info(pdf_file)

# Page count
pdf_i$pages
#> [1] 35

# Author, Producer, Title
pdf_i$keys
#> $Author
#> [1] "Prudential Regulation Authority, Bank of England"
#> 
#> $Producer
#> [1] "Microsoft: Print To PDF"
#> 
#> $Title
#> [1] "Review of Solvency II - QIS Instructions"

All is as expected, so we can now load the PDF file in full using pdftools::pdf_text. This returns an R character vector with one entry per page (so 35 entries since there are 35 pages as noted above). Each page is a continuous stream of text, with embedded newlines ("\n" in R).

pdf_pages <- pdftools::pdf_text(pdf_file)
str(pdf_pages)
#>  chr [1:35] "Review of Solvency II 㤼㸶 Quantitative Impact Study (QIS)\nInstructions for completing the QIS template(s)\n\nVer"| __truncated__ ....

Extract the Credit Risk Premium parameters

From looking at the file in a PDF viewer, the credit risk premium (CRP) parameters are on page 28. We use cat to view this page with newlines shown as such, rather than "\n". It can be seen that the ‘text’ version of the PDF provided by pdftools is reasonably faithful to the original, with the embedded tables laid out logically, and so it will be possible to extract the numerical values.

cat(pdf_pages[[28]])
#>       (ii) A percentage applied to the 5-year average spread on an index of the same sector
#>            and CQS as the asset.
#> 
#> 10. The percentages to be used in 9(i) and 9(ii) are as follows:
#> 
#>                                                               Scenario A                  Scenario B
#>  Percentage applied to current z-spread                            25%                        25%
#>  Percentage applied to average spread                              25%                         0%
#> 
#> 11. The 5-year average spreads to be used for the purposes of 9(ii) above are as follows:
#> 
#>                    5-year average spreads (bps)
#>                Financials            Non-Financials
#>      CQS 0           90                       79
#>      CQS 1          121                      112
#>      CQS 2          180                      159
#>      CQS 3          276                      197
#>      CQS 4          462                      379
#>      CQS 5          693                      656
#>      CQS 6          693                      656
#> 
#> 12. For the purposes of this QIS, we will be testing both a floor and a cap applied to the total
#>     CRP. The floor will be applied in Scenarios A and B and all the sensitivities tested under
#>     them. The cap will be tested both on and off in Scenario A only19. The following table
#>     sets out the floor and cap to be applied to the total CRP.
#> 
#>                  CRP floor for Scenario A and B
#>                                                                  CRP cap for Scenario A (bps)
#>                               (bps)
#>                   Financials      Non-Financials                  Financials         Non-Financials
#>      CQS 0               11                      4                     56                    42
#>      CQS 1               27                      19                   101                    86
#>      CQS 2               43                      28                   153                    124
#>      CQS 3               72                      40                   229                    166
#>      CQS 4              168                     117                   507                    403
#>      CQS 5              391                     184                  1,037                   623
#>      CQS 6              391                     184                  1,207                   793
#> 
#> 13. The following table summarises the scenarios in which the floor and cap should be
#>     applied.
#> 
#> 19 As stated above, there are a number of cells in Scenario A of the template with the explicit direction that figures
#> are to be provided on the basis of a cap being applied. Unless such explicit direction is give, please provide
#> figures on the basis that the cap is not applied.
#> 
#> 28

The text representation isn’t perfect – footnote 19 is not shown in superscript, for example – but it’s definitely usable.

To ease parsing page 28, we use strsplit to break the text up by newlines, so we can read each line individually. For example lines 6 to 8 hold the CRP percentages to apply to z-spreads in each scenario, lines 13 to 20 hold the 5-year average spreads, and lines 30-37 hold the floors and caps:

p28_lines <- strsplit(pdf_pages[[28]], "\n")[[1]] # strsplit returns a 1-item list here and we want the first item itself

# Percentages
p28_lines[6:8]
#> [1] "                                                              Scenario A                  Scenario B"
#> [2] " Percentage applied to current z-spread                            25%                        25%"   
#> [3] " Percentage applied to average spread                              25%                         0%"

# 5-year average spreads
p28_lines[13:20]
#> [1] "               Financials            Non-Financials"
#> [2] "     CQS 0           90                       79"   
#> [3] "     CQS 1          121                      112"   
#> [4] "     CQS 2          180                      159"   
#> [5] "     CQS 3          276                      197"   
#> [6] "     CQS 4          462                      379"   
#> [7] "     CQS 5          693                      656"   
#> [8] "     CQS 6          693                      656"

# Floors and caps
p28_lines[27:37]
#>  [1] "                 CRP floor for Scenario A and B"                                                    
#>  [2] "                                                                 CRP cap for Scenario A (bps)"      
#>  [3] "                              (bps)"                                                                
#>  [4] "                  Financials      Non-Financials                  Financials         Non-Financials"
#>  [5] "     CQS 0               11                      4                     56                    42"    
#>  [6] "     CQS 1               27                      19                   101                    86"    
#>  [7] "     CQS 2               43                      28                   153                    124"   
#>  [8] "     CQS 3               72                      40                   229                    166"   
#>  [9] "     CQS 4              168                     117                   507                    403"   
#> [10] "     CQS 5              391                     184                  1,037                   623"   
#> [11] "     CQS 6              391                     184                  1,207                   793"

Now we need to pull out the numbers. This is hardly worth doing for the percentages, which could just be typed in, but it establishes the general principle we’ll use for the other tables.

The idea is to use regular expressions to search for numerical values systematically. The regular expression [0-9] means ‘a digit character’ and the "+" means ‘one or more of’. Thus [0-9]+ will match a sequence of digits of any length. This is sufficient to parse the percentages, but to deal with embedded commas and decimal points in later tables, we expand it to [0-9,\\.]+, which means one or more of (1) a digit character (2) a comma and (3) a decimal point.

The decimal point is a ‘special’ character for regular expressions that matches anything, so we have to precede it with a \ character, and in turn this has to be doubled for R to recognise it as a \. The power of regular expressions makes it worthwhile to tolerate these fiddly subtleties!

This isn’t a fully general way of parsing numbers, but it suffices for the PRA QIS tables here.

We wrap this into a function parse_numbers that matches the numerical text, deals with embedded commas, converts the results to numbers while retaining the embedded decimal points, and arranges the results as a matrix.

Enclosing this regular expression in parentheses means ‘capture anything matching this expression’, and the combination of gregexpr and regmatches captures the matching expressions and discards the non-matching text.

For simplicity we add the row and column names manually rather than by parsing.

parse_numbers <- function(text_lines) {
  matches <- regmatches(text_lines, gregexpr("([0-9,\\.]+)", text_lines))
  matches <- t(simplify2array(matches)) # t to transpose the result of simplify2array
  nr <- nrow(matches) # Keep the number of rows so it can be restored later
  matches <- gsub(",", "", matches) # Remove commas
  matches <- matrix(as.numeric(matches), nrow = nr) # Convert to actual numbers rather than text representing numbers
  return(matches)
}

crp_fac <- parse_numbers(p28_lines[7:8]) / 100 # Convert to numerically correct rather than percentages
rownames(crp_fac) <- c("Current", "5yrAverage")
colnames(crp_fac) <- c("ScenarioA", "ScenarioB")
crp_fac
#>            ScenarioA ScenarioB
#> Current         0.25      0.25
#> 5yrAverage      0.25      0.00

We use parse_numbers again to pull out the 5-year average spreads, and put these in a data frame labelled with the CQSs in the first column:

crp_5yr_avg_bp <- as.data.frame(parse_numbers(p28_lines[14:20]))
crp_5yr_avg_bp[,1] <- paste0("CQS", crp_5yr_avg_bp[,1])
colnames(crp_5yr_avg_bp) <- c("CQS", "Financial", "NonFinancial")
crp_5yr_avg_bp
#>    CQS Financial NonFinancial
#> 1 CQS0        90           79
#> 2 CQS1       121          112
#> 3 CQS2       180          159
#> 4 CQS3       276          197
#> 5 CQS4       462          379
#> 6 CQS5       693          656
#> 7 CQS6       693          656

Finally we pull out the CRP floors and caps, which are laid out as a 5-column table, and split this into two separate data frames:

crp_floor_cap_bp <- as.data.frame(parse_numbers(p28_lines[31:37]))
crp_floor_cap_bp[,1] <- paste0("CQS", crp_floor_cap_bp[,1])
crp_floor_bp <- crp_floor_cap_bp[,1:3]
crp_cap_bp <- crp_floor_cap_bp[,c(1, 4:5)]

colnames(crp_floor_bp) <- colnames(crp_cap_bp) <- c("CQS", "Financial", "NonFinancial")
crp_floor_bp
#>    CQS Financial NonFinancial
#> 1 CQS0        11            4
#> 2 CQS1        27           19
#> 3 CQS2        43           28
#> 4 CQS3        72           40
#> 5 CQS4       168          117
#> 6 CQS5       391          184
#> 7 CQS6       391          184

crp_cap_bp
#>    CQS Financial NonFinancial
#> 1 CQS0        56           42
#> 2 CQS1       101           86
#> 3 CQS2       153          124
#> 4 CQS3       229          166
#> 5 CQS4       507          403
#> 6 CQS5      1037          623
#> 7 CQS6      1207          793

Extract Valuation Uncertainty parameters

The Valuation Uncertainty (VU) parameters are on page 29.

p29_lines <- strsplit(pdf_pages[[29]], "\n")[[1]]

# VU parameters
p29_lines[19:24]
#> [1] "         CQS                  IFRS Level            VU for Scenario A            VU for Scenario B"
#> [2] "                                                          (bps)                        (bps)"      
#> [3] "          All                   Level 1                    NIL                           Nil"      
#> [4] "          All                   Level 2                     7.5                         3.75"      
#> [5] "          0-3                   Level 3                     7.5                         3.75"      
#> [6] "          4-6                   Level 3                    25.0                        12.50"

We pull these out, omitting the row for IFRS Level 1 since this complicates the parsing, and starting the parse at character 30 to omit the CQS column which will otherwise confuse matters.

We then add in the CQS column, manually expand the rows to refer to CQS explicitly, and include zero values for IFRS level 1 at the top.

vu_lines <- p29_lines[22:24]
vu_lines <- unname(vapply(vu_lines, function(x) substr(x, 30, nchar(x)), character(1)))
ex_cqs_vu_bp <- as.data.frame(parse_numbers(vu_lines))
ex_cqs_vu_bp[,1] <- paste0("IFRS", ex_cqs_vu_bp[,1])
colnames(ex_cqs_vu_bp) <- c("IFRSLevel", "ScenarioA", "ScenarioB")
vu_bp <- rbind(data.frame(CQS = paste0("CQS", 0:6), data.frame(IFRSLevel = "IFRS1", ScenarioA = 0, ScenarioB = 0)),
               data.frame(CQS = paste0("CQS", 0:6), ex_cqs_vu_bp[1,]), # First row is IFRS2 all CQS
               data.frame(CQS = paste0("CQS", 0:3), ex_cqs_vu_bp[2,]), # Second row is IFRS3 CQS0-3
               data.frame(CQS = paste0("CQS", 4:6), ex_cqs_vu_bp[3,])) # Third row is IFRS3 CQS4-6
vu_bp
#>     CQS IFRSLevel ScenarioA ScenarioB
#> 1  CQS0     IFRS1       0.0      0.00
#> 2  CQS1     IFRS1       0.0      0.00
#> 3  CQS2     IFRS1       0.0      0.00
#> 4  CQS3     IFRS1       0.0      0.00
#> 5  CQS4     IFRS1       0.0      0.00
#> 6  CQS5     IFRS1       0.0      0.00
#> 7  CQS6     IFRS1       0.0      0.00
#> 8  CQS0     IFRS2       7.5      3.75
#> 9  CQS1     IFRS2       7.5      3.75
#> 10 CQS2     IFRS2       7.5      3.75
#> 11 CQS3     IFRS2       7.5      3.75
#> 12 CQS4     IFRS2       7.5      3.75
#> 13 CQS5     IFRS2       7.5      3.75
#> 14 CQS6     IFRS2       7.5      3.75
#> 15 CQS0     IFRS3       7.5      3.75
#> 16 CQS1     IFRS3       7.5      3.75
#> 17 CQS2     IFRS3       7.5      3.75
#> 18 CQS3     IFRS3       7.5      3.75
#> 19 CQS4     IFRS3      25.0     12.50
#> 20 CQS5     IFRS3      25.0     12.50
#> 21 CQS6     IFRS3      25.0     12.50

Save the parameters

Finally we save the CRP and VU parameters to RDS (R format) and CSV (for Excel) files for later analysis.

saveRDS(crp_fac, "qis21_crp_fac.rds")
saveRDS(crp_5yr_avg_bp, "qis21_crp_5yr_avg_bp.rds")
saveRDS(crp_floor_bp, "qis21_crp_floor_bp.rds")
saveRDS(crp_cap_bp, "qis21_crp_cap_bp.rds")
saveRDS(vu_bp, "qis21_vu_bp.rds")

write.csv(crp_fac, "qis21_crp_fac.csv", row.names = FALSE)
write.csv(crp_5yr_avg_bp, "qis21_crp_5yr_avg_bp.csv", row.names = FALSE)
write.csv(crp_floor_bp, "qis21_crp_floor_bp.csv", row.names = FALSE)
write.csv(crp_cap_bp, "qis21_crp_cap_bp.csv", row.names = FALSE)
write.csv(vu_bp, "qis21_vu_bp.csv", row.names = FALSE)

Summary

In this post, we showed how the R package pdftools and R’s built-in regular expressions capability can be used to pull numerical information out of PDF files.