Loading NEMSIS Data into SAS (Huge Text Data with Weird Delimiters)
Oct 29, 2019
SAS
The Problems
NEMSIS nation data comes as text files in the 20+ GB range. Loading them is challenging for two reasons.
A Three-letter Delimiter
First, the delimiter between the fields is not a single character. Rather it is ~|~
. That means that my preferred solution in R, the Haven package, will not load the data and my default in SAS, proc import
needs to be tweaked. There is no way to have proc import
process a three-letter delimiter, but proc import saves the import code, which is a data step, into the log. SAS data step code can be tweaked to handle the three-letter delimiter.
Character Strings for Bad Data
The second issue is that SAS cannot efficiently scan through many gigabytes of data to figure out if columns are character or numeric. So, you will need to deal with errors when SAS tries to read in a text string indicating missing data in a numeric column.
My workaround for SAS guessing the wrong column type is to build a “sample full of problems” dataset and to use proc import
to write code to get that problem file to import. Once the code works on a small dataset full of problems, the same code can be run on the full data.
How did I find the problematic records? I will tell you exactly how below but here is the basic idea. The first time I ran proc import
it found some problematic records. The SAS log file told me which line/record, of the text that I was trying to import, had issues. I added those lines/records to the problem data set. I then ran import on that file and I noted the format, informat and variable type that successfully loaded the problematic data. I made the changes, which I noted when I successfully loading the problem data, to the data step running on the full data set. If that found more problems I added the bad records to the problem dataset, got the details that would fix the issue and tried again. I had to repeat the process a couple times to get all the errors.
Making the file of problems
You should be wondering how I was able to make the problem file. I had to break out an old language called sed and mix it with some basic UNIX shell scripting.
Step 0 Configure your Mac or Windows Box
If you are running SAS on a Mac with Parallels check my tutorial on configuring your Mac which is here. If you are using Windows look here.
Step 1 Open A Terminal or Git Bash
The first thing to note is that you can save the first 10 lines of a text file called thing.txt into a new file called sample.txt using the UNIX head command like this in either the Mac Terminal or in GitBash on Windows:
head thing.txt > sample.txt
Pulling line 100 from the middle of a file called thing.txt and saving the results into a new file called sample.txt can be done with sed using code like this:
sed -n '100p' thing.txt > sample.txt
The workflow
For this example, assume I am trying to import a NEMSIS file called Pub_PCRevents.txt.
The process that I used to load the data was:
- use proc import with the delimiter set to
|
Proc import code goes here
- look in the log to find the records/lines/observations that have problems
- grab the first five line from Pub_PCRevents.txt and save them into sampleOfPub_PCRevents.txt
head -n 5 Pub_PCRevents.txt > sampleOfPub_PCRevents.txt
- use sed to add any problematic records to the sample. For example, if the log noted a problem on lines 1000123 and 2000100 I would add them to the sample like this:
sed -n '1000123p' Pub_PCRevents.txt >> sampleOfPub_PCRevents.txt
sed -n '2000100' Pub_PCRevents.txt >> sampleOfPub_PCRevents.txt
Do note that when I first made the sample of the dataset I used >
to send the sample into a new file and when I am appending the problems on to that existing file I use >>
.
Once sampleOfPub_PCRevents.txt is created, run proc import
on it. The data step code that proc import
generates from the sample data has the corrected import code for any of the problem lines in sampleOfPub_PCRevents.txt. So, that code contains the fixes needed for the original code that failed importing Pub_PCRevents.txt. The only trick is to be sure to fix three things for each problem variable. The variable type needs to be tweaked (by adding a $ to indicate character data) and each problem variable needs to have its format and informat adjusted to match the code that came from the successful import of sampleOfPub_PCRevents.txt.
After the variable types, formats and informants are fixed, one part of the infile
statement needs to be adjusted. Use the option dlmstr= “|” instead of dlm = “|”.
Finally remove all the references to the variables that are holding “~”.