The PAGC postal address geocoder: BUILD PHASE PRODUCTS

7. BUILD PHASE PRODUCTS

When the build phase completes, it produces a number of files in the same directory as the reference shapeset. These are:

__db.001
__db.002
REFERENCE_SHAPESET_NAME.err
REFERENCE_SHAPESET_NAME.ix0
REFERENCE_SHAPESET_NAME.ix1
REFERENCE_SHAPESET_NAME.ix2
REFERENCE_SHAPESET_NAME.ix3
REFERENCE_SHAPESET_NAME.ix4
REFERENCE_SHAPESET_NAME.pgc
REFERENCE_SHAPESET_NAME.pgx
REFERENCE_SHAPESET_NAME.sts

The first two files (__db.*) are used by Berkeley DB to maintain its environment. These files may not be produced by Windows versions of PAGC. The next file (REFERENCE_SHAPESET_NAME.err) is a file that notes errors experienced during the build phase. These errors may range from serious to innocuous. The next five files (REFERENCE_SHAPESET_NAME.ix*) are files in the Berkeley DB format that index the main data file, REFERENCE_SHAPESET_NAME.pgx. The file REFERENCE_SHAPESET_NAME.pgc is a short text file that contains certain data PAGC will need to carry over from the build phase to the match phase. The last file (REFERENCE_SHAPESET_NAME.sts) is the statistic file, a text file, which is produced if the optional -z flag is used ( See the Statistics file).

7.1 The PGX File

The standardized data is stored in the pgx file. The data corresponding to each attribute detected is placed in a field terminated by a comma. The record starts with a pipe (|) and ends with a pipe. The record is stored in a Berkeley DB b-tree using the unstandardized reference table's row/shape number as a key. The file is named REFERENCE_SHAPESET_NAME.pgx and may be viewed when the build is complete by using the -dREFERENCE_SHAPESET_NAME -i-1 option. Missing or blank civic numbers are denoted by a -1. Otherwise blank or missing values will be represented only by the terminating comma. Since the standardization may include attributes not present in the unstandardized schema, fields are reserved for them.

7.2 The Index Files

The standardized data is also put into four indices. The indices associate a key with the unstandardized row number. The only MACRO ( See MICRO/MACRO) value used in the keys is the POSTAL attribute, if present. The HOUSE attribute is also not included in the indexing. The full feature index (REFERENCE_SHAPESET_NAME.ix0) includes all the other MICRO attributes plus the POSTAL. The streetname index (REFERENCE_SHAPESET_NAME.ix1) takes the STREET attribute only as a key, and then STREET plus POSTAL, if POSTAL is present. The soundex index (REFERENCE_SHAPESET_NAME.ix2) concatenates the soundex value of each word in the streetname to make a key - but does not include the POSTAL attribute. The three foregoing indices are Berkeley b-tree files.

The fourth index is the edit distance index. It is a Berkeley memory pool file, denoted REFERENCE_SHAPESET_NAME.ix3, and is a pointerless trie. Each STREET value and each STREET value plus POSTAL value is inserted into the trie. It is basically a recognition device, and no record numbers are associated with its insertions. When searched with a candidate key, it produces all keys within an edit distance of 2.

These files are used in the match phase to produce candidate reference records to match against a user address. They are searched in the following order:

The full feature index is searched first, looking for an exact match on the full feature name ( street name, postal code and street name modifiers).
The streetname only index is next searched for exact matches on just the street name plus postal.
The approximate index is searched for streetname - postal matches within an edit distance of two. If not already retrieved, each new candidate is scored (using the edit distance on the streetname and postal if there is no exact match) and added to the list.
The soundex index is searched for an exact match on the soundex code for each individual word in the streetname.

A fifth file, denoted REFERENCE_SHAPESET_NAME.ix4, is also created and populated during the build phase. For each reference arc, regardless of whether or not its attributes are standardized and included in the pgx file, two index records are created. Each will index one of the endpoints of the arc. This index is used in the matching phase to find blocks that made correspond to missing address ranges.

7.3 Recovering the Build State.

When the build is complete, certain information is saved in the REFERENCE_SHAPESET_NAME.pgc file to be transferred from the build phase to the match phase. If a file with the same path name already exists, it will be overwritten.

The PGC file

The first line of the file will be a signature, PGC. The second line will be LITTLE if the architecture is little endian and BIG if not. The next line will be a series of integers, the first giving the size, in machine-dependent size_t units, of the size of an integer. The next integer will be the number of attributes to be used in the matching. There then follow a series of records, one for each of the enumerated attributes, giving the output symbol (postal attribute), the comparison type ( See Comparison Types), the two matching weights, and the field numbers for the associated fields in the standardized records and an unstandardized records, to the maximum of four fields and with a -1 denoting no field.

The following is the contents of a pgc file using a Statscan schema with the default matching weights, on a little endian computer with a 4-byte integer. There are 7 attributes, the first being HOUSE (1) with a comparison type of 8 (an internal representation of the NUMBER_INTERVAL_LEFT_RIGHT comparison type), an m of .9999 and a u of .05, and 4 pairs of fields for its data. The standardized field numbers are numbered 0 , 1, 2, and 3, and the unstandardized fields are numbered 14, 15, 16 and 17. The other six attributes are then represented in a similar fashion.


PGC
LITTLE
4 7 
1 8 0.999000 0.050000 0 14 1 15 2 16 3 17 
2 1 0.800000 0.100000 4 13 -1 -1 -1 -1 -1 -1 
3 0 0.700000 0.100000 5 -1 -1 -1 -1 -1 -1 -1 
4 0 0.700000 0.100000 6 -1 -1 -1 -1 -1 -1 -1 
5 1 0.900000 0.010000 7 11 -1 -1 -1 -1 -1 -1 
6 1 0.850000 0.100000 8 12 -1 -1 -1 -1 -1 -1 
7 0 0.850000 0.100000 9 -1 -1 -1 -1 -1 -1 -1

Rebuilding.

All of the build-created files ( See Build Phase Products) should be deleted if the reference data needs to be rebuilt. Because BerkeleyDB's environment files must also be deleted for rebuilding and because more than one reference may use the same environment, it is recommended that there be only one reference shapeset per directory - unless you are relatively certain that you will not need to rebuild.

To clean a directory of build-created files:


rm -f __db.* *.ix* *.err *.pg*

7.4 The Build Error Log

The build phase error file takes the same name as the reference attribute table and is placed in the same directory as the reference shapeset. It takes the extension ".err". These errors are those logged during the build showing standardization errors. These are logged to assist in the revision (if required) of either the reference attribute table or the standardization files ( See Standardization Files).

The bypassing of records that are not considered geocodable (ie they don't have an address range on either blockface) is not recorded in the error log.

Build Log Errors

No Alpha MICRO Standardization: No standardization of ADDRESS for row N: skipping!. ADDRESS is the unstandardized MICRO portion of the address in record N of the REFERENCE_SHAPESET_NAME xbase table.
RR Error: Unsupported rural route address: ADDRESS. The string ADDRESS is the MICRO portion of a standardization candidate for a standardized reference record. It contains an RR route, which PAGC can't handle. This error will be logged in the build error log. The appearance of this error does not necessarily mean that the program failed to standardize the address.
No Alpha MICRO Standardization: No standardization of ADDRESS for row N: skipping!. ADDRESS is the MICRO portion of the unstandardized address data in row N of the REFERENCE_SHAPESET_NAME reference attribute table.
No Schema-conformance: No schema-conforming stz for row M: Using stz N: ADDRESS. The program was unable to find a standardization of the MICRO portion of the unstandardized address data in row N of the REFERENCE_SHAPESET_NAME reference attribute table.
HOUSE MACRO Mismatch: SIDE house addresses but no SIDE macro for row N. SIDE is either "left" or "right", and N is the row of the REFERENCE_SHAPESET_NAME reference attribute table. It indicates that the MACRO fields for the left/right blockface were missing, even though there was an address range for that blockface.
No Alpha MACRO Standardization: Skip row N: No stz for SIDE macro ADDRESS. N is the row of the REFERENCE_SHAPESET_NAME reference attribute table. SIDE is either "left" or "right". The blockface MACRO data could not be standardized.

In addition to these errors, more general errors are also recorded if they occur after the log has been opened.

An Example Build Error Log

As an example, consider the contents of the file Whatcom.err, produced by a build of a Tigerline format shapeset of Whatcom county. It consists of three errors:


Right house addresses but no right macro for row 3051

No schema-conforming stz for row 14089:
E,19 Crst

Using stz 0:


Prefix Direction: EAST
Street Name:      19
Suffix Type:      CRESCENT


Left house addresses but no left macro for row 15532

These errors may be interpreted as follows:

Right house addresses but no right macro for row 3051. This error indicates that row 3051 shows an address range on the right blockface, but lacks a corresponding right zip. This is an error in the reference attribute table. If considered serious enough, one could find out what that zip code should be, edit the table, and enter the correct zip.
No schema-conforming stz for row 14089
This indicates another error in the reference. The commas indicate field divisions - and note the lack of one between the 19 and the Crst. The Crst has been incorrectly included by the reference in the FENAME field rather than the FTYPE field. The standardizer can't find a correct standardization that corresponds to the reference and substitutes its own, which in this case is the correct one.
Left house addresses but no left macro for row 15532. This is an error similar to the one in row 3051, but with the zip code omitted on the left side.

7.5 The Statistics File

If the -z is specified with -b flag, a file is produced giving the hit frequency for each build rule applied to a standardization candidate, and the frequency for which it was chosen as the best standardization. These statistics may prove useful in creating or reweighting rules for the reference locale. The format of the statistics report is similar to that given for the standardization test ( See Standardization Test).

The following are two examples of the statistics from the build of whatcom.dbf:


Rule 4 is of type 2 (ARC)
: Input : |1 (WORD)||2 (TYPE)|
Output: |5 (STREET)||6 (SUFTYP)|
rank 13 ( 0.825000): hit frequency: 0.118579, best frequency: 0.811918
9075 hits out of 76531, best 9061 out of 11160

This entry is interpreted as follows:

first line: The first line gives the rule number (4) and number (2) and name (ARC_C) of the rule type ( Rule Types),
second line: The second line gives the rank (13) - a number between 0 and 17 - and then the value at which this rule is applied in this context (0.825000). The hit frequency (0.118579) is the percentage of times that this rule was tested against the input against the total tests against input, and the best frequency (0.811918) is the percentage of times this rule was selected as best for a reference record.
third line: The third line gives the numbers from which the hit and best frequency in line 2 were calculated.

This rule, then, is the one used to standardize 81% of Whatcom's reference records. A second rule is responsible for another 11%:


Rule 694 is of type 2 (ARC)
: Input : |22 (DIRECT)||1 (WORD)||2 (TYPE)|
Output: |2 (PREDIR)||5 (STREET)||6 (SUFTYP)|
rank 12 ( 0.800000): hit frequency: 0.018581, best frequency: 0.111828
1422 hits out of 76531, best 1248 out of 11160

Next Previous Contents