[SOLVED] How do I pull specific fields from this XML file using R or Python?

Issue

I am attempting to convert this XML file from this USA government public data source into a clean relational database/ tabular format for the critical information. The problem I am running into is that there appears to be multiple field names (such as Last Name and Location) stored within the same node. What would be the best way to do some basic parsing of this file?

See this screenshot below from the XML guide where the information inside the orange box is what im struggling to pull apart.
enter image description here

Ideally a result would look something like this:

LAST NAME FIRST NAME MID NAME
Doe John Bob

Solution

If you are using R, it is straightforward to get these fields using the xml2 or rvest packages. For example, using the first xml file in the linked zip folder:

library(rvest)

entries <- read_html(path_to_xml) %>% 
  html_nodes(xpath = "//info")

result <- data.frame(Last_Name = entries %>% html_attr("lastnm"),
                     First_Name = entries %>% html_attr("firstnm"),
                     Mid_Name = entries %>% html_attr("midnm"))

head(result)
#>   Last_Name First_Name Mid_Name
#> 1    FISHER     ANDREW   MUNSON
#> 2 BACHARACH       ALAN   MARTIN
#> 3     GRAFF    MICHAEL  RAYMOND
#> 4      KAST    WILLIAM    ALLEN
#> 5   McMahan     Robert  Michael
#> 6   JOHNSON       JOHN        C

Created on 2022-07-09 by the reprex package (v2.0.1)

Answered By – Allan Cameron

Answer Checked By – Marilyn (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published.