Daniel W. Goldberg
Department of Computer Science
GIS Research Laboratory
University of Southern California
Kaprielian Hall (KAP) Room 444
Los Angeles, CA 90089-0255
daniel.goldberg@usc.edu
June 1, 2006
[PDF Version Available]
Increase Font Size -
Decrease Font Size
Even from the very first discussions on the subject, [111], there was substantial confusion as to what geocoding meant, and it was identified by them as a semantic problem. There are very good points and descriptions of the various meanings for geocoding and the types of geocodes, which are still in use today. The authors make the distinction between systems that geographically identify (provide codes/names for places, without specifying boundaries) versus those which geographically define (providing codes for regions, and specifying the boundaries) (from [99]). Relatedly, a discussion of the fundaments differences between nominal (code names a feature, no spatial relations), ordinal (codes name and define spatial relations for features as in a hierarchal system), and cardinal (that define absolute positions in geographic units) geocodes (this is actually from [104 ]). This very early paper about the problems encountered when trying to develop national geocoding systems determines that main inhibiting factor is the lack of standard data sets or codes used throughout the country. Each locality (basically county) had their own system for geocoding as did each federal bureau and department, and urban and specialized geocoding systems were far ahead of the national system. This fact was exemplified by listing 4 dominant application areas for geocoding are presented: indexing and tabulation, logistics, spatial distributions, and land use, each of which has their own systems for geocoding, and their own datasets used as basis/codes. Thus, the author decides the goal of the national geocoding system is to convert between these various geocoding systems, not replace them, since so much money has been placed into creating them.
In another very early article about urban geocoding, [31], the term is still vaguely defined, and confused between the coding (naming) part and geo (assigning a geometric) part (again from [99]). This paper agrees that geocodes can be either, simple codes with no coordinates, or coordinates. The authors define geocoding as 1) determining a geographic unit, and 2) assigning it. The nominal, ordinal, and cardinal types of geocodes (again from [104]) are recited, and address geocoding is called a conversion problem (between the nominal code for address and the cardinal code for the point). Recognition is made that the term geocoding is becoming a combination of the two; geocoding and geodefining; and foresee the emergence of more tools doing the same. The authors recognize, like [111], that national geocoding will suffer from compatibility problems. A very nice history of the early geocoding systems that were developed is given, providing a nice description of the strengths of each and how the weaknesses prompted the creation of the next generations, leading ultimately away from to the block face representation used in the 1970 Census to the segment based representations (DIME) that are in use today (turned into TIGER, post paper publication). The DIME encodes a flow network (street segments) as well as the boundaries of areal units (defined by the segments, i.e. blocks), and allows for block and node chaining to detect errors. The article concludes with a discussion of how using parcels would increase the accuracy of geocoding greatly, as well as using a tiering system of increasing levels of accuracy. A very similar history of the DIME and TIGER and their relationship to the early census mapping for the US as well as the UK equivalent is described in [82 ].
[91] reports on the 1st 15 years of geocoding principles and practice at the US Census Bureau. Geocoding is defined as ``the assignment of geographic codes to input records'', emphasizing the use of codes for addresses instead of free form word representations (like [76 ]) that work well for the USPS. The fundamental components of the geocoder are defined as 1) the geographic reference file, and 2) the computer programs that A) identify the parts of the address, and B) link these components to the standardized components in the geographic reference file. The reference files contain the street names, address ranges, and the code for the smallest geographic area that is identified in the system. The computer program is describes as ``blocking and weighting'', and first go through a discussion of blocking (making the set of candidates to check as small as possible to improve speed and because the search space is enormous), and block on (in 1967) zip and Soundex to get 14 per block, and adding in address range gets down to 2 per block. Weighting is discussed further: 1) naively - each component counts an equal amount and how many match (used in 1963) determines the matching record, and 2) probabilistically - a great discussion on how using information theory, conditional probability, and decision theory can be used to determine the best matches in conjunction with a measure of risk that you are willing to take (cost function). This potentially leads to Type I (false negatives) and Type II errors (false positives) however, and the cost function determines the threshold of confidence that must be achieved in a match. A great detailed history of the advances made in data and technology in the 1st 15 years at the census is given. 1963: Economic Census System - first automatic geographic coding of address data, address reference files (ARF) for cities > 25,000 people, and separate building reference files (BRF). Parsing was done with an ``unscrambler'' and was token based, reading left to right. The parsed addresses were standardized and matched to the ARF and BRF subject to the constraint that at least two fields had to match, and the Census used the naive summation of points for each component that matched to determine the best match, ties were solved manually. 1967: the BRF was combined into the ARF, and included zip codes. The Census had ARFs for all cities > 2500 people, used the same parsing, and the probabilistic methods for matching. The Census got 70% of 4,000,000 addresses matched, and reduced Type I errors by 90% and Type II by 80%. A test census was run on New Haven in 1968, where the Address Coding Guide and Dime (intersection coding, block pairs, and directions) reference files were developed, as well as the ADMATCH program (unscrambler and linker - input address -> reference file record). The unscrambler had support for variant name spellings, and the linker allowed for variability in the number of fields which had to match, allowing for the control of error. However, the ADMATCH was for local use, and the Economic Census Geocoder beat it when used at a national scale in 1975 (81% no error ADMATCH, 92% 1% error, Census). 1972 and 1977: modifications to the weights were made, with more risk taken in 1972 and less in 1977, and the string comparator from the USPS was introduced in 1976. 1970: the block face reference files (Address Coding Guide) were developed, and required no probabilistic scheme because the address files had the same codes. 1980: only did deterministic coding, errors were recycled until they matched to improve the accuracy of the underlying data set. 1987 (when the paper was written): the census chooses not to use probabilistic geocoding methods because of the quantity and sources of records and requirement of highly accurate geocoding. In sum, the authors found that most geocoding errors were due to inadequacies of the reference data.
As early as 1995 the foundations of geocoding were known. In his seminal work, [30], outlines the entire geocoding process as what he terms ``address matching'' (highlighting that there was confusion as to what geocoding actually was even at the beginning). A detailed history of geocoding from its roots in the 1970 US Census is introduced along with the development of the TIGER files. He predicted the three factors that would drive geocoding: increasing power and decreasing cost of machines (still allowing for more and more computationally expensive approaches), sophistication of GIS software (now with online maps, and interoperable services), and availability of TIGER on CD ROMs (as well as other datasets that are now freely available). He defines the process as first matching a reference segment (perfect, partial, potential, and unmatched), then parity (side of street) then interpolation, then offset. Check fields (city, zip) are used to distinguish ambiguities. Even in this first work, all geocoding is described as an approximation, and in particular the error associated with the address range method had been identified. Additionally, it was recognized that reference data have different levels of accuracy (TIGER, USPS, local), and standardization of addresses was essential. The match rate and error rate were defined, and application tradeoffs between the two were highlighted. Systematic bias based could be introduced by processes and reference data used. He defined transformation tools as those that could convert addresses between one format and another for better matching such as Soundex. Procedural tools on the other hand are criteria relaxation, scoring tables, and statistical probability. This description was paraphrased in [3 ].
[89 ] describe how geocoding was a part of the Drug Market Analysis Program (DMAP) in Pittsburg to combat the crack epidemic by predicatively identifying hotspots using neural networks. In this work the authors outline the interpolation based method (the address range method), as well as point based where the points are derived from parcel boundaries and building footprints. The authors note that point based is more expensive, but more accurate. However, the authors consider geocoding to be ``address matching'', e.g. addresses are matched against an ``address coverage'' containing all points for an area, so if an address is missing, it is not geocodable. The recognition is made that a multiple addresses can reside at the same location, and create one too many mappings. Sources of non-matches stemming from inaccurate ``address coverages'' (meaning that their reference file of geocoded points is wrong) or inconsistencies in address spellings id made. At this time, there were few probabilistic methods for matching addresses, so the authors state that they needed to be picked from candidates by hand, which was time consuming. Instead, the authors came up with their own ``geocoding algorithm'' that corrected inconsistencies and picked candidates based on a set of rules.
[57 ] compares the address matching / geocoding capabilities of ArcView 3.0 and MapInfo 4.0. It does provide a nice background to the geocoding process in a pretty straightforward way, and difficulties are explained nicely with simple examples. The geocoding process is deemed dependent on 3 components, 1) the geocoding engine, 2) the base maps (reference data), and 3) the address list. It argues for correct definitions of ``geocoding'' versus ``address matching''. Geocoding is defined as ``assign[ing] an absolute location, through x and y coordinates, to a geographic feature referenced by a relative location..'', while address matching ``assigns an absolute location (x and y coordinates) to each address in an address database file by interpolating specific address locations against a geocoded street theme coded with address ranges'', and further points out that ``street networks that serve as the interpolation bases are already geographically referenced or geocoded''. This is a very confused description that pins the interpolation process as the address matching part. The history of TIGER and the address range method is described, and reasons for (city style address ranges, overlapping ranges, gaps in streets) and explanations of enhanced derivates of TIGER like GDT are listed. The authors also note that geocoding in rural areas is hard because of rural route addresses (non-city style), and that the adoption of E911 is forcing localities to establish city style addressing. An explanation of how dropbacks are used, different levels can be used (exact match or boundary - zips), and how the attributes can be relaxed and matches determined based on levels of ``stringency'' (combination of spelling sensitivity, minimum match score, minimum score, and abbreviation replacement), and how the iterative process picks candidates for non matches are all presented.
In later work [58 ], an overview of 4 different geocoding engines is presented and the Coding Accuracy Support System (CASS) from the USPS is used to USPS commercial geocoders and the mailing industry as the basis for standardized addresses is described (USPS to reduce undeliverable mail, geocoders to improve accuracy, and mailing industry to get discounts).
[117 ] differentiates between address matching (``the process of adding location information to a database containing business, survey, or administrative record'') and geocoding (``the process by which a point locations defined by street address or other address information, to a map'', and ``It is the computer equivalent of pushing pins into a street map on wall''), as the authors attempt to describe the importance of address data and geocoding applications in Turkey, through a prototype implementation of an Address Information System (AIS) for Trabzon City, Pelitli Municipality, Turkey. No address standard in Turkey is argues as creating a high degree of confusion between address based applications.
[84 ] describes the process of geocoding in relation to the Internet and discusses in value terms of the amount of money that it saves. Geocoding is defined as standardization (correcting the address) and geocoding (finding and interpolating from the reference db). The authors also talk about how conflation can be used to improve the reference db to get more accurate results in both standardizing and geocoding.
[22 ] describe how TIGER can be used nationally, but the accuracy depends on the reference data used, and that using parcels would provide the most accurate results, while using zip codes and census tracts are less accurate. The authors also describe how a point-to-polygon can be used to go from points to larger areas like zip codes, and emphasize that address standardization and address cleaning are critical.
Even in the most recent research [35 ], the definition of geocoding is confused. The process of geocoding is mistakenly defined as ``a method in which certain characteristics of a person are estimated based on characteristics of the person's area or neighborhood''. This exemplifies the misunderstanding of geocoding. Geocoding is used to generate a georeferenced point for the person, which can then be spatially intersected with datasets that contain other information, in this case census data for socioeconomic information.
[86] Master's Thesis provides very detailed theoretical background on the geocoding process (mostly the address matching portion, very little on the interpolation methods - the authors claim that [59 ] provides the best evaluation of interpolation), as well as a practical evaluation of geocoding with ESRI to determine the sources of geocoding error and offering solutions to solve them. A review the fundamental geocoding component algorithms in 1) probabilistic record linkage, 2) standardization (probabilistic and deterministic), and 3) Soundex, mainly relating the whole thesis to the geocoding available in ESRI is presented. Reasons for the importance of geocoding in crime, health, commercial applications, and environmental applications with point data sources are listed. A detailed literature review is presented citing ([31] vs. [111]), the [15] accuracy discussion, the three sources of error 1) data input, 2) reference data errors, and 3) geocoding process errors (uncitied, from someone - find). Geocoding is defined like ESRI as ``the process of assigning an x, y coordinate value to the description of a place by comparing the descriptive location-specific elements to those in the reference data''. A basic breakdown of the geocoding process is given: 1) parsing, 2) standardizing abbreviations, 3) assign tokens to categories, 4) search in reference, 5) scoring the candidates, 6) filtering, and 7) returning the best match. The matching problems from [93] are listed directly (without direct reference) with some slight additions - 1) standardization, 2) unrecognized components, 3) non valid street address, 4) ambiguity, 5) data entry, 6) vanity addresses, 7) mixed parity, 8) non-address input, 9) non-properly attributed reference data. An excellent history and description of probabilistic record linkage is presented whereby record linkage consists of 1) data, 2) standardization (probabilistic and deterministic), 3) searching/blocking, 4) selection of attributes, 5) weighting, 6) decision model, and 7) performance measures. Type I and Type II errors are used (this is from [91]). Most of the record linkage is from [56], the deterministic standardization is from ESRI and the probabilistic if from Australia HMM ([21]), Soundex is described nicely. DIME and Nickel (ESRI - 2 segments for each road) are described in detail and Nickel is said to be junk because it separates data between tables which need to be merged back into a dime (flat file) format for efficient geocoding. Problems with reference data are listed as 1) many feature types, many to one relations, and sub addresses. The authors present an amazing timeline of the geocoding progress made at the census bureau, from [91]. A brief survey of people who studied the accuracy is given: 1) [69] - 96% accuracy so everyone should use it, 2) [94] - minimum match rate, removed points until the patterns were no longer the same, described sources of matching error: misspelling, wrong prefix, unknown abbreviations, wrong types, impossible addresses, no address at all, unit number confusion, extraneous text - but he misses (and the authors catch) that address matching errors are not random: one address on a street mismatched because of any of the above will mean all addresses are mismatched - some sort of confounding 3) [59] - evaluated 3 interpolation techniques (actually all the same, linear, just using different products) and found they are all roughly the same and argues for more sophisticated interpolation software to handle out of sequence addresses, better address matching, and reference databases with higher accuracy, and 4) [21]- Australia and the G-NAF use of HMM usage for standardization. The authors go through the need for address standards, and address system problems: 1) no complete list within a jurisdiction, 2) different database formats, 3) non-addressable properties, 4) incorrect addresses, 5) inconsistent exception handling, 6) each government entity keeping their own list, and 7) the lack of high resolution data (3D, multi-residential, commercial structures). A claim that addressing is commonly attributed to Benjamin Franklin in the 19th century (Postmaster General) is made. An explanation of the datasets used is given: centerlines (parity errors, full - major and true - residential address ranges), parcels (low match rate, not all true addresses, address not always correct), land lists, and address point lists (output of geocoding, manually created from other datasets - big maintenance issue). In their practical section the authors analyze the existing match rate, propose a work flow, and make improvements. Three types of errors are determined data entry, correctable (in the reference file - add alias table correct ranges, add more intersection match tokens), and geocoding base (the suite and commercial plaza are hacked around by making it look like an address). A nice table showing the breakdown of error types is given. The workflow is interesting in that it hints at dynamic geocoding (like asking user for more information), and foreseeing the need to include geocoding methods as they are developed. Manual corrections to the reference files are made to improve the match rate, an alias table added, added and removed abbreviations from the ESRI files, removed unused attributes from the ESRI files, and created a bunch of utility ESRI scripts. The thesis concludes with improvements which are mainly things that are actually limitations imposed by the ESRI software, except talk about dynamic geocoding and including linear referencing ([34], [27], [105 ]- geocode anything that can be identified). Appendices which are a how-to for using the ESRI geocoder, including code, are given.
[72] introduces a framework for geospatial data extraction and integration based on georeferencing addresses extracted from web pages. The authors give a nice survey of the current geospatial data extraction, object consolidation, and data integration approaches, and explain how their wrapper and integration system work. The wrappers are exactly the same as from [62 ], and the integration is based on other people's work but is still interesting. Their system does 1) crawl, 2) geoparse (identify location info), 3) extract (turn it into xml), 4) integrate (object consolidation), 5) geocode, 6) output to a GIS DB, 7) display on top of traditional GIS data sets. Nice detail on their extraction tool (same as http://www.fetch.com), and their data integration process defining ``semantic equivalence'', ``semantic similarity'' and a measure of the ``degree of semantic similarity'' using the vector space model, with weights from term frequency inverse document frequency (TFIDF) are all presented. The geocoding process is defined as 1) parsing, 2) matching, and 3) locating, which all require two types of data 1) address infrastructure DBs (address point DBs or centerlines) and 2) reference DBs to solve ambiguities and revert to lower resolution (gazetteers, zips, borders) but none of the details of any of these or their methods are presented, other than to say that tokenization was used to find the address pieces. A small set (540) of features is used, and good results (90% match rate) are achieved. A proposal to investigate the use of ontologies to recognize address pieces automatically to geocode better is presented.
[14 ] presents a detailed discussion of the format of the tiger files.
[33 ] argues for the importance of addresses in GIS, how they make geographic context ``attribute driven'', provide powerful query ability, and are human friendly geographic keys. The authors declare that small geographic resolution requires small geographic keys (addresses), and that addresses are 1) the key to legacy data and are the most common type of data used in government. A discussion about who can assign addresses (municipalities, counties - rural, USPS, and 911) is given, and that the format (number, name, direction, quadrant) together form the unique key which, once set, should not be changed. A differentiation is made between the jurisdiction district (people who have authority) and the addressing grid (which increments numbers from base lines), which should be frozen and small jurisdictions should use base lines and grids from larger ones to maintain consistency. The address standard used in Orange County/Orlando FL is discussed, as well as how it took 13 years to standardize and assign addresses to all 260,000 parcels (including vacant ones), and why USPS and Tiger are not suitable because they try to handle every case.
[26 ] reports on the success of an address base creation project in Belo Horizonte, Brazil. Through a combination of low accuracy parcel maps (containing all parcel address information), high resolution imagery, and address lists (created from utility companies and taxing offices), the authors were able to accurately place 410,000 addresses into the correct parcels for the whole city. The authors used a manual approach that was aided by displaying zooming to the block automatically and asking for all the addresses on the block interactively. A rate of 3.4 symbols (using three symbols - address, parcel, and block) per minute was achieved.
[34 ] provide very short technical report about the variety of addressing systems used in LA County, and the problems that using multiple addressing systems create. The authors provide succinct definitions that are useful: address - anything that can denote a location, addressing system - set of rules for assigning addresses in a network, house numbering system (HNS) - addressing system for buildings, house numbering scheme - using more than one HNS at a particular place (a city using multiple systems in different parts of the city). The authors present some interesting cases where the transitions between systems are bad. 3 types of addressing systems are classified: 1) Directional - with baselines (N/S separator), meridian (E/W separator) and origin (baseline and meridian cross, or the central stating point), quadrants - main regions in a certain direction, azimuth - orientation of entire system. 2) Linear - monotonically increasing from a starting point either mile-posting (in miles) or stationing (in feet). 3) Order numbering - houses are assigned numbers as they are built (similar to messed up placed in Europe or Asia). Parity is discussed briefly, noting the special case of ``nipple streets'', or cul de sacs off larger streets that have a single parity, and directions in names of streets. The rest of the paper offers ways to encode the HNS attempts and suggests how it would be possible to keep track of all these systems for the purposes of adding or changing numbers when necessary, editing/correcting the TIGER files, and geocoding. Interesting exceptional cases are presented.
[76 ] describes the addressing system in Denmark which is based on ``roof top addresses'' instead of linear interpolation and base line files. The authors describe the importance of the standardized addressing system as the key to all data throughout government, distinguishing between coordinate reference systems and identifier reference systems (including addresses, place names, and cadastres - anything that can be located without coordinates). A claim is made that addressing systems are the best for referencing because they are 1) well known, 2) of sufficient detail, 3) practical/logical, and 4)) visible. The Danish law regulating addresses is described - each street has a unique code, each stairway has a unique number, and each apt on a stairway has a unique number, in addition it is law that each dwelling must be assigned a unique code, and each person living in a dwelling must be registered to a place at an address. Using codes reduces misspellings, and the laws relieve the need for a census. A discussion of the data structure of addresses is made, questioning if they should be attributes of other entities - creates inconsistencies in storage between agencies, or entities in their own right - which the authors choose - and have a centralized master ``address gazetteer'', under a single authoritative control, with a differentiation between new addresses and assignments of old ones. The address contains spatial location, reference to named road, house number. Rooftop addresses: good - positional accuracy, completeness, join capabilities, simple db structure, simple queries (no geocoder), bad - don't know the geometry of the road (no route planning), and missing addresses (in between ones) can't be interpolated. To generate the address database, the authors geocoded from base maps with address data and got 90% accuracy, remaining 10% interpolated from neighboring addresses. The system also gets automatic delivery of error addresses every month for correction.
[77] offers much of the same information as [76 ], but includes a discussion of the different types of representation that a geocoder can produce and why they are needed in what cases: 1) polygons - represent a set of addresses N:1, where knowledge of the exact address is not crucial, 2) network based - know the network segment the address is on, requires the network to be created, allows for interpolation as well as routing and is N:1, and 3) point based - 1:1, and is needed when an address must be accurately identified in a larger GIS dataset (on a map or in an image).
[61 ] discusses the parcel numbering system in Korea, as well as the historical changes that it has undergone from its initial institution hundreds of years ago. A description is given of the various types of numbering systems that can be used, and parcel is proclaimed to be the key to all geographic data in Korea through its 1) characteristics (unique), and 2) roles (used as key to relate to buildings, streets, cadastral data).
[74 ] presents the need for, challenges, and work done to upgrade the entire addressing system of the whole state to the E911 standard for West Virginia, which means that every phone line in the state needs to have a city style address. This was a huge task for them involving departments up and down the state, county, an local levels because 66% of the state population live in rural and unincorporated areas using rural route addressing. Their approach was to first obtain high resolution orthoimagery, create GIS datasets from it (street centerlines, hydrography, building footprints, etc), and then to begin assigning addresses in rural areas, and ensuring that existing city style systems in the urban areas (who had previously taken up their own initiatives, using their own standards) fit in with their new addressing standards.
[108 ] looks at the necessity of using standardized addressing systems in the modern computer driven GIS era, and asks if the old rigid addressing systems are still required. The authors give a very nice off the cuff history of address management techniques, differentiating 1) reliance on lists (exact matches in text files), and 2) reliance on line segments (the theoretical address range versus the real address range, and how these create problems in geocoding as the basic linear interpolation algorithm is described). A distinction is made between the requirements that a person has vs. a computer. A person needs 1) find street, 2) know east, west, north, south, 3) consistent orientation, 4) consistent polarity, and 5) sequential numbering - because they are using the system for navigation. A computer only needs 1) unique addresses, not sequential, 2) any label to match, not just an address, 3) topology for routing, consistent attributes not important, 4) the ability to store multiple labels for the same feature. The argument is made that computers will go back to the list reliance (address points, named places, parcels) for the most accurate representations, but could then revert to interpolation as addresses don't match or business needs dictate.
[105] presents a discussion paper by the UN Economic Commission for Africa (ECA) that talks about an addressing system for Africa. This paper provides a very thorough discussion of the need for such a system, the problems that hinder its development, and proposes recommendations for ways to overcome these hindrances. The authors state that a ``situs'' addressing system is more appropriate for Africa because it is a framework for dealing with multiple types of addresses (thoroughfare, parcel, landmarks) instead of the street address systems used in the West, it is a ``hybrid address model that accommodates names, aliases, and geographic reference'', meaning that it partially contains a gazetteer to link from popular to official and references via landmarks, or that a gazetteer is part of the situs addressing model. The problems encountered are very similar to those in Brazil with slums in particular (but the problems are everywhere) having ``chaotic land occupation'', unregistered land, no physical addresses, and buildings and parcels with no street access [27 ]. The ``duality'' of addressing is described as being caused by local people not consulted during the naming, boycotting, and confusing system choices- being referred to by both an ``unofficial'' (yet popular) name and also an ``official'' one, and that there is very bad signage and is sometimes in the local language. The authors point out all the ways that these hinder the development of the countries (economic, political, social), and call for a region scale addressing system to be used throughout the continent, and all the ways that it will benefit them (governance, service delivery, record production, safety, and helping citizens). This has to be accomplished with the help of the people through a bottom up approach, using standardization (with flexibility) to promote 1) sharing of data, 2) reduce redundancy, 3) help with the identification of parcels, and 4) promote census, and the use of GIS tools, a great detail is presented as to how these can help (temporal tracking, integration of data, visualization). A proposal is made to include fire hydrants and traffic lights (presumably because no other infrastructure exists like in the west, where these are separated out into separate GIS datasets). The requirements of a good system are defined as that which provides addresses that are 1) identifiable, 2) accurate, 3) accessible, 4) simple, 5) unique, 6) consistent, 7) transferable, 8) logical, 9) flexible, and 10) cost effective. A solution to the slum problem is presented as taking into account any building or parcel that can be identified and giving it a unique address.
[23 ] describes the use of lexicon-based tokenization and probabilistic hidden Markov models (HMM) for address and name standardization with Australian data. The authors present the basic idea of record linkage (citing other survey work mostly), and differentiate between deterministic and probabilistic. The standardization task is broken into 1) segmentation, 2) transformation of components into canonical forms, (optionally) 3) imputation of missing components, and (optionally) 4) enhancement with known alternatives. The cleaning and tokenization algorithm are presented: 1) lowercase, 2) canonical form of substrings, 3) punctuation normalized, 4) vectors based on white space, and 5) assign each token a category based on lookup tables (to produce hints). An excellent introduction to HMMs is given, describing the training process which uses several training sets and passes through subsets of the data, feeding the incorrect matches back into the system. The trained on only 1450 records and tested on a completely different data set than they trained. The authors found that the HMMs worked very well for standardizing addresses, but not very well for standardizing names.
[21] explains the development of a probabilistic geocoding system based on HMMs and inverted indexes using the Geocoded National Address File (G-NAF) in Australia and a hierarchical rule base. This approach, based on a file, does not interpolate (as far as listed), but finds an ever increasing level of resolution. Their approach uses machine learning to develop a parse based on Febrl (from [23]), and creates inverted indexes for the target dataset to match to (reference file). A hierarchical matching algorithm is used that allows for the inclusion of other data sources of less and less accuracy, and the system tries each one in order of decreasing accuracy for a match. Geocoding is defined as a `` special case of [record] linkage'', and a discussion of its importance is given along with a differentiation between batch - where addresses are matched to best possible match and a match status is returned (quoting [92] that the GNAF literature claims a 70% match rate is generally acceptable -2004), and single when a list of alternatives are returned. The GNAF model (32 million addresses of varying resolutions - address, alias -> localities) is presented. The system presented in the work: 1) preprocesses - cleans the addresses and reference file (GNAF) (using [23 ]), presented with a nice description of HMMs and using the Viterbi algorithm to pick the best path through the HMM, and allowing the system to include other sources to clean and fill in missing info (PO information, neighborhood regions) 2) creates inverted indexes of the reference, 3) passes the input address and the inverted index into the geocoder. The geocoding engine uses a hierarchical matching approach to start high resolution and move outward from the street through the locality to neighboring localities to non-neighboring localities. If multiple results are returned, they can either rank them (likelihood) or return the weighted average. The system achieved a 94% any match type, 73% exact match.
[20] provides an excellent summary of the freely extensible biomedical record linkage (Febrl) project. The importance of record linkage in health is presented, listing geocoding as a special case. An excellent summary is presented of 1) record linkage techniques - deterministic/probabilistic, probabilistic - classical/machine learning, 2) data cleaning/standardization and sum up the work with HMMs from their previous work [23,21 ], 3) blocking - standard/sorted neighbor/bigrams, 4) record pair comparison - vector of weights, 5) geocoding - error, method - preprocessing/matching, theirs - inverted index/rule based (exact match, approximate match, increase neighborhood, loop), 6) need for parallelization and their implementation with Message Passing, and 7) the need for data generation because of privacy constraints.
[19 ] present the building blocks for a bus routing system which includes a geocoder as part of a GIS, but do not give any details on how the actual geocoding will be done, just stating that it is important.
[52 ] include the geocoder as a crucial component in their trigger based pervasive computing framework.
[79 ] presents the findings of surveys performed on all 54 states (50 + DC, Puerto Rico, NYC, and US Virgin Islands) by the National Center for Health Statistics. The NCHS and the National Association for Public Health Statistics and Information Systems (NAPHSIS)) found that 21 (of the 41 who responded) were geocoding, 91% had improved data quality, 95% found geocoded data useful for research, and 93% of those not geocoding wanted to. The respondents were using software from 1) in house 2) GDT, 3) ESRI, and 4) Finalist/Final Focus. Some were outsourcing to private firms and universities. The major concerns were 1) speed, and 2) confidentiality.
[1 ] presents an excellent review of the need for confidentiality in data, how geographic masks can be used to achieve it, existing ways that it is presently accomplished, and several new methods for doing it (affine transformations, random perturbations, aggregation, neighboring, and context. The authors claim that the best approach depends on the purpose in mind as well as the degree of disclosure risk that is acceptable, and provide an excellent evaluation of the different approaches to provide guidance as to which a researcher should choose.
[51 ] illustrates the use of GIS in examining the equity of access to dental services in the UK. Geocoding is performed to the postcode level for the 2976/5630 addresses that can be geocoded and get a 93% match (with the remaining 7% due to incompleteness or errors). The remaining 2700 that were missing were shown to not introduce bias based on previous studies, and were not in any pattern (spatial or socioeconomic). The authors correctly recognize that their results will be subject to some bias because of incorrect geocoding resulting in misclassification of SES.
[12 ] discusses methods for developing a GIS to calculate long term pesticide exposures from historical data on crop reports and pesticide spraying in order to determine regional differences in the links between pesticide use and disease rates. The authors develop detailed models to account for the transport of the pesticides and determine ``exposed'' from ``unexposed'' based on proximity (400 meters for aerial, 40 m for ground). The transport model presented takes into account AgDrift, ISCST3, wind rose, and temperature inversion heights to develop a power law best fit derived algorithm for residences near aerial applications. Geocode is performed to the parcel centroids, but nothing is stated about the accuracy of the geocoding, which when using such sensitive models can have a large effect if there is a slight error in geocoding (as in the case when others adopt these methods and don't have parcel data, and need to interpolate).
[25] talks how the DHHS feels about the critical role that geospatial data plays in the national spatial data infrastructure, and in particular how important it is to the public health community. Needs for emergency management and surveillance are discussed. Geocoding is listed as a critical component which needs a ``best practices document'' and standardization, and a national health goal ([106 ]), and local health departments are listed as not being as good as state and nationals at this. The necessity of confidentiality in public health data is discussed. Interoperability is determined to be a key component of realizing a true NDSI for health geographic data, as is conflation (with a good example of NYC 911 trouble). The ACOE TEC is said to be working on interoperability as a primary goal.
[22], while reviewing the use of geocoding in recent health studies in their survey of GIS usage in Health, state that geocoding is one of the main methods for data gathering, and as such has been placed as a national goal ([79]), a Healthy People objective ([106]), and that the availability of high precision (address point) geocoding would fuel the demand for spatial analysis tools. The authors also point out that health data is necessarily low resolution out of privacy concerns. This same sentiment about the prevalence of using geocoded data as the basis for analysis is echoed in the example studies listed in the review of GIS usages in Epidemiology offered by [87 ].
In their excellent review of the characteristics, uses, and sources of spatial data in cancer studies, [8 ] presents a detailed review of the geospatial data sources and types that are available, aimed at informing the cancer research community on how to begin using geospatial data. The authors discuss the characteristics of spatial data (created by geocoding, and necessarily aggregated - which can introduce bias). Geocoding is defined as ``individuals assigned coordinates corresponding to their residence'', and provide a set of references that study the accuracy of geocoding. MAUP is discussed, in particular how integrating data from different resolutions causes problems, and how spatial autocorrelation can introduce bias, and non-stationarity. Many different data sources are discussed, highlighting how each has been used in particular studies. A nice list of studies that have used geocoded data as their basis is presented.
[9] presents a very detailed review of the use of GIS in the health fields. The authors describe 1) example applications, 2) problems hindering widespread adoption of GIS, and 3) requirements for widespread usage. A nice job relating this review to previous ones is offered. The different ways that GIS are used (spatial statistical modeling, hybridding socioeconomic data, choropleth maps, smoothed maps, clustering, spatial data mining, remote sensing, and location based services) is given. A large number of health application examples are presented, as well as several reasons why GIS is underutilized (academics not closely working with community, no georeferencing of patient records, necessary aggregation for privacy, lack of coordination and high level policy goals). Finally, requirements set forth in a number of studies are merged into a nice common representation. The authors speak to the need of standards and emphasize the importance of the OGC and GML. Geocoding is listed as one of the most essential spatial infrastructure-building tasks (quotes [25]), and makes reference to the NAACCR section of the GIS Handbook [114] that discusses the importance of geocoding. Conflation (apparently NYC is doing a big project [25]) and integration are listed as top requirements as well. A section is given about the problems of using aggregate data, and its introduction of ``spatial uncertainty'' (from [1]), as well as statistical methods to mask data and preserve geographic certainty. A short section on data errors paraphrases [90 ]. A really good history and examples of spatial data infrastructures is listed, and the requirements for a real time GIS health/environmental surveillance system with example applications are given. Requirements and data problems are represented with nice breakdowns in tables.
[37] looks at the residential location as a potential determinant for exposure to chemicals from a superfund site. The authors use 3 geocoding reference file to maximize reliability (ESRI StreetMap, Tiger, and Delorme). The authors are cautious about their geocoding results, citing the most prominent works ([15], [6], [8], [83]), but the effects of geocoding are presented in their other paper ([48 ]). Interestingly, the authors were unable to geocode PO Boxes and rural routes, but did find that the distribution of these non-matches had no pattern with regard to their covariates, so their exclusion did not introduce any bias (this could be because the whole town is white middle class).
[48] presents a related study that serves as an introduction to using GIS and geostatistics. A very good introduction to geostatistics is presented as the authors outline the steps that were taken to characterize a superfund site. The importance of georeferencing and geocoding is stressed, paraphrasing the ESRI definition ``linking any piece of information or attribute to a geographical location which already exists in a GIS''. Not too much detail about any error introduced during geocoding is given, even though [37] claims the details of geocoding are in this work. [37 ] were perhaps referring to the details of georeferencing the superfund site, to which there are great details.
[63 ] presents an early examination on the validity of using census level socioeconomic data to supplement individual level records when the individual level variables are not available. The authors geocode 1,924,995 people to the census tract, block, and county at a cost of $4.50/1000!!! This was in 1985, achieving a success rate of 82%. The results show that using census level data as individual data is valid for determining the associations of socioeconomic characteristics with measures that are known to vary by race and SES (smoking, # pregnancies, hypertension, and height). The authors do notice that their data may be biased because of 1) non-geocodable addresses, 2) census undercount (mainly in poorer areas), 3) the temporal validity of the data, and 4) ecological fallacy.
[42] presents a statistical framework that assesses the validity of using aggregate data as a proxy for individual information. There is great statistical detail of how to model the effects, and a description of 2 sources of bias is given: 1) errors-in-variables - aggregate variable is only imperfectly correlated to the micro-variable it represents, and 2) aggregation bias - aggregate variable may be correlated with the micro-level equation. The work uses Panel Study of Income Dynamics (PSID) (geocoded to zip code and tract to match census variables (95%, and 72% match rates)) and National Maternal and Infant Health Survey (NMIHS) (zip code -> census data (87% match)) data. The findings show that there was less variability in the census data (because it is averages). In general the authors caution against the use of aggregate data as a proxy for individual data and find two trends 1) when there are variations in relevant independent variables within aggregate units -> it will underestimate micro variables, and 2) when the aggregate variables represent broader constructs than the micro-level constructs -> it will exaggerate the effect of micro-level variables. Criticism is made of [63 ], stating that the findings, that consistent results between micro and aggregate data, are the exception and not the rule.
[39] sparked off a heated discussion as it argues that 1) the age of the census data used in health equations is not a huge factor and 2) that the choice of smaller units over larger ones does not have a large impact. These assertions are proven statistically, and their regression shows no effects from 1) because the census data is stable over a 10 year period and 2) because the socioeconomic variations are quite similar between the larger (zip code) and smaller (census tracts) areal units. Additionally, the authors argue that there is no advantage of including multiple aggregates measures over a well chosen single one, and that the conceptual ``differences among aggregate variables are more blurred than those between their micro level counterparts'', and refer to their previous statistical framework which defines two sources of bias ([42 ]).
[80 ] presents a framework to automatically remodel the census geographies based on the needs of the user. It presents a very nice breakdown of 4 stages history 1) pre-1960's (everything manual), 2) 1960's (digital statistics, 3) 1980's digital (geographies and the availability of techniques to remodel the geographic data after publication), and 4) 2000's (digital statistics, geographic encoding, and geographies - allowing for custom geographies to be developed even before the census is released). The authors give a nice history of the work that people have been doing to remodel census geographies (areal interpolation and surface construction), and describe a prototype for the ``automatic creation of output areas, independent of collection geography used'', which is made possible by the existence of the ADDRESS-POINT DB containing the geocoded coordinates of every postcode in the UK (the building blocks used to create polygons from which any geography may be created, satisfying any objective function).
[67] responds to that by [39 ] because 1) sample structure and low geocoding match (68% zips, 72% census tracts, compared to reported 80% zips and 90% blocks in other studies) rates may have introduced bias and inflated the confidence of the sample, 2) didn't perform reliability analysis on the health related census variables, instead using the chi-square, 3) ignore the studies comparing effect estimates based on individual and area based socioeconomic indicators, and 4) incorrectly assert that census block data is unavailable and that it ``systematically excludes rural residents''.
[40] responds to [67] criticism of their article. The authors restate their specific research question ``How well individual level associations can be inferred when aggregate socioeconomic variables proxy for individual characteristics in health equations?'', arguing that problems presented in [67 ]are not problems: 1) data was not representative - the other data sets used successfully geocoded 95%, 2) statistical methods were flawed - chi squared is appropriate, and 3) results are at odds with most research - the did cite other research, and there are only 3 papers addressing the same question, 4) block group data are available - not with confidentiality concerns and the lack of geocoded data to that level.
[101] also responds to [39 ]. The authors present the details from their own study that show that smaller areas do improve the predictability of health, also arguing that aggregate area based measures are not reliable substitutes for individual based measures, and that unit changes at one level of aggregation do not correspond to unit changes at the individual level.
[41] responds to the [101 ] letter, in particular responding to the question of whether or not a unit change in aggregate will have a larger effect on changing health outcomes than a unit change in a comparable individual level variable. The authors argue that aggregate data can be used as a proxy for individual in their particular context because of the relationships between the dependence of the variables used.
[103 ] present a spatial analysis of the bovine spongiform encephalopathy epidemic in Great Britain. For confidentiality, the study geocodes the farms down to the Parish level, and use the centroid, assuming that the parishes are circular, and that the holding areas for cattle are in the center. When farm crossed multiple conjoined parishes, the centroid of the largest was used, and when disjoint the centroid of them all was used. The authors argue that the aggregation does not affect the results because the level of aggregation is still small compared to the study area, the whole country.
[102] presents a follow up to the hotly contested [39] study. The authors attempt to perform the same study, except with the problems other point out addressed: 1) non-representative data sets - used a huge set of self reported health data and the 1990 census, full coverage of all US, 2) poor geocoding performance - 90% match rate to block level, and 3) too small scale of a study - 183,706 people. Two analysis were performed: 1) ordinary least squares - size and direction of bias, control for SES confounders between health and race, demonstrate exploratory strength of different measures, and 2) replication of [39] - errors in variable bias, and magnitude and direction of aggregation bias. The findings show that even though their data is much larger scale and uses block level data, only slight improvement was achieved with smaller units, the same as [39 ]. However, the authors did differ in their findings that aggregate proxies did sufficiently control for SES confounding in the interpretation of race and health.
[66] attempts to address the question of whether the choice of areal unit and the size of the geographic unit that people are geocoded to has an effect on the outcome of health studies mater. The authors geocode to the census block, tract, and zip code level using a commercial firm, whose accuracy has been ``validated'' [69 ], achieving 93% to the block, 99% to the tract, and 94% to the zip code. The study finds that the choice of measure and the level of geography chosen do make a difference in the outcome of results.
[70 ] attempts to address the problems of 1) choosing which area-based socioeconomic measures to use, and 2) choosing which level of geography. The authors were able to geocode 80% and 99% to the block and zip code, and were independent of any socioeconomic patterns. The study determined the SES gradients per disease and non-fatal weapons injuries, and that both the choice of measure and the choice of geographic scale matters.
[13 ] reports on the efforts to build a reliable and accurate street centerline geodatabase for the Metropolitan Government of Nashville and Davidson County, Tennessee. The authors found that accurate geocoding was best achieved through enforcing data entry correctness with the use of forms that enforce the piecemeal entry of the street address components individually, checking the consistency of each as it is entered.
[64] offers a very brief commentary on the increasingly important role of geocoding in epidemiology. The author states that completeness does not equal success (meaning that a high match rate is not necessarily very accurate), and cites [53] showing that PO box geocoding to the zip code centroid will introduce significant misclassification into the results, and comments on the expensiveness of [83] attempts to geocode every single address using a multistage approach. Additionally, the author states that every study needs to determine the accuracy requirements, and cites how some cancer study needs very accurate results (like the pesticide and pollution studies), but that [6 ] has shown that geocoding is mostly pretty accurate.
[18 ] presents a study on the correctness of assigning people's ethnicity/race from the census block group values that are derived from point in polygon operations that determine the appropriate census block via geocoding the address. The authors have nice references to the difficulties of geocoding, the validity of using census as a proxy for SES, and on (the very few) studies that evaluate the accuracy of SES data from geocoded census tracts. This study was done on a huge population (3.9 million addresses) by Kaiser Permanente, and compared the self reported race from 1) insurance records and 2) birth certificates from their hospitals. The study only used Blacks and Asians because of problems (ambiguities) with the census questions for white and Hispanic. People were ``assigned'' to a race if their block was over 50% one ethnicity, assuming that the ethnic area of a dependant was the same as that as the primary (if they did not live at the same place) - perhaps an assumption with serious ramifications. The authors were able to geocode 93% of addresses to the block, but do not go into detail of the geocoding method. This high result is probably due to relaxed conditions - anywhere on the street (even false addresses) go into the same block group anyway. The study shows that the ethnicity produced from the census data is very wrong.
[35 ] use geocoded results to obtain socio economic attributes about people to determine managed care disparities.
[65] demonstrates the feasibility of augmenting health surveillance systems with socioeconomic (SE) characteristics of the area the subjects reside. This work relies heavily on the ``96% accuracy'' of their commercial firm ([69]). The Census tract is used as the areal unit (found to be ideal based on their previous papers) because it 1) consistently detected SE gradients, 2) had the maximum geocoding success rate and match rate to SE data (versus the block and zip code units), and 3) is readily interpretable - these three are similar, but not the same as Gregorio2005. The authors claim the ability to use the SE status of ``poverty'' from the Census tract to surveil SE inequalities of health, stating that the results are not biased because of 1) geocoding error (from their commercial firm), 2) choice of areal unit (because similar results are achieved using different sizes), and 3) because the census tracts are designed to be ``relatively homogenous'' and are used by the government to distribute low income money and services. Instead, the SE factors are 1) composition (people in poor areas have health because poor people have poor health) 2) context (a concentration of poverty exacerbates harmful social interactions), and 3) location of public goods and environmental pollution. The authors claim that the SE measures are not proxies for individual data, and that SE area based measures apply to all persons in the area (directly in opposition to [18 ]), however, conceding that their estimates are subject to concerns because of race misclassification, and census undercount.
[44] use different areal units to determine if they make a difference in cancer incidences. The authors find that using increased resolution does not increase the accuracy, but mention that the discrepancies could be due to differing abilities to geocode across different locales (cartographic confounding). Geocoding requirements are defined as 1) accurate - within an acceptable distance, 2) precise - at a desired level of areal unit, and 3) ``fit for use'' - applicable to data ([18]). This is realized as being part of the MAUP, and previous attempts to determine if different choices matter are listed - [66], [100 ], concluding that the artifacts associated with choosing differing areal unit sizes can not be reliably predicted (potential research topic). A good discussion of the tradeoffs in different levels of geocoding is presented: 1) time and training, 2) protecting confidentiality, and 3) interpretability of results
[97] presents an overview of geocoding specifically geared toward its usage in cancer research. Two topics are explored: 1) how geocoding is used in cancer research, and 2) methods to improve the accuracy. The authors begin with a discussion of why high accuracy is needed because small distances matter in fine-grained models, and how geocoded locations are used to link to other types of data. Three methods of geocoding are described: 1) assigning an observation to a geographic unit (without specifying where in the unit), 2) interpolation, and 3) parcel matching. TIGER is briefly presented, as well as the interpolation method (match segment, interpolate). Sources of error are given: 1) hit rate, 2) inaccurate geometry in ref file, 3) missing data in ref file, 4) interpolation - address ranges are incorrect and too large so the points clump up at the low end, 5) errors in the input data, 6) using Soundex, and 7) weighting component pieces and how there is a tradeoff between match rate and accuracy. The authors claim that using point in polygon for determining census codes is really bad because it creates serious errors when there are little errors in the geocoded points, arguing for the usage of a lookup table instead that maps address to segment, and then gets the census ID from the segment. A good discussion of problems using zip codes as geocodes is presented, namely that zip codes have not been used by the census since 1990, they change over time, and the ZCTAs (aggregations of census blocks) are misaligned with zips. ZCTAs are stated as only being good for use when geocoding to the census block level, and an example of misuse in the Medicare field is given - data created using the zips and displayed and analyzed using the ZCTAs. Linking cancer data to environmental data is discussed, along with why it's hard because environmental data is usually raster and cancer data is based on vector. Problems in determining distances are covered: 1) not using the road network and 2) centroids of areas are not the best location to use (spatial aggregation error). A great discussion of the issues revolving around confidentiality and privacy is presented: 1) privacy is an issue when geocoding health data, 2) points on maps suffer from inverse geocoding, 3) have to balance privacy and public good - contrast to [90 ] who says getting rid of privacy will be the best possible thing for healthcare). An overview is given of newly emerging models that can account for the spatial accuracy of the geocoded points, and any spatial masking that has been performed, as well as an intro to the ways to spatially mask data and still maintain its usefulness, and state how some methods are suitable in some circumstances and others in others. The NAACCR recommendations for geocoding are restated, focused mainly on correcting the data at ingest, and argue for the extension of census tract certainty measures to finer grain levels such as census blocks.
[38 ] is one of the first to investigate the possible effects of geocoded points being placed into incorrect enumeration districts (ED), and was one of the first to caution that this could have serious impacts on research performed using the aggregate data associated with the incorrectly assigned ED. It presents a very nice history and description of address data types in the UK (postcodes, CPD, and Pinpoint Address Code (PAC)), the systems that all other UK geocoding papers refer to. The study looks at 1) the accuracy of the Central Postcode Directory (addresses-> postcodes) and the PAC 2) the assignment of ED from postcode based on assignment (to spatially nearest ED centroid - not physical centroid, skewed to population center) and allocation (point in polygon), of which the authors feel the 2nd is more important. The study focuses on a rural area, and there is a note that ``gazetteers'' were used to create the PAC, which contained premises, streets, districts, towns, and postcodes - being a really early reference that gazetteers can contain addresses. A mean PAC point is computed from the PAC addresses in the postcode to compare with the single CPD point for a whole postcode. The study finds that the CPD are 97% within 200m, but the misclassification is big problem. The two methods of associating ED (assignment versus allocation) are compared to the optimal PAC point in polygon, because the authors note that most researched don't have the money to buy the PAC and want to know just how bad the CPD assignments compare. Through some statistical analysis, the results show that the CPD data can be used pretty well, if point-in-polygon matched to digital boundaries, and recognize that the digitization of the boundaries will introduce uncertainty for addresses near the boundaries. The question of using the generating the centroid of the PAC is addressed: 1) number of points varied in different postcodes, 2) it was subject to outliers - with the authors offering to some measure of central tendency (spatial mean, Weberian location) to adjust for this, and 3) that the density of the PAC are not evenly distributed throughout a postcode. The authors outline how these dispersions of addresses can be modeled to give some insight into how well the centroid represents the PAC.
[5] presents the problems that the Chicago Police Department had when trying to geocode crime incidents for spatial statistical analysis. The main problems were 1) missing segments and erroneous inclusions (found that actually 50% of street segments were wrong), and the 120 foot accuracy of TIGER making streets on the boundaries particularly problematic, 2) the existence of unnamed street (were more prevalent in certain areas - this is related to ``cartographic confounding'' [88], and should be noted), and new and re-aligned streets post census (conflicts with [45 ] that this is not a problem), 3) multiply named streets, 4) missing or incorrect address ranges for blocks, 5) the non-existence of address ranges for highways, 6) user input error from beat officers - spelling, abbreviations, and street aliases. The authors solved most of these errors by correcting the TIGER files, and creating a database of mappings between common errors and the correct versions in the TIGER files, without doing anything probabilistic.
[81] provides a very good history of geocoding in the UK and how the datasets have become increasingly more detailed, starting with the existence of the central postcode directories (grids accurate to 100 m) and post code address files, moving to the Pinpoint Address Code, and culminating with the development of the Ordnance Survey's ADDRESS-POINT Reference (OSAPR), with further details available in [49]. This was generated in a similar fashion as to points geocoded via TIGER. It used the Ordnance Survey's Land-Line road networks, and assigns ``property seed points'' with a 0.1 m resolution. Additionally history is presented regarding the adoption of the British Standards BS7666 parts 2 and 3 which specifies a national address standard (part3) and defines a national land and property gazetteer (part2). The further define ``hybrid georeferencing'' combining datasets of differing levels of resolution/accuracy to obtain socioeconomic data, and how levels can be generalized or refined to derive more or less detail. The main analysis the authors perform talks about 1) the difficulties matching addresses to the other available data sets (census and postal, and property registers (the last being described in greater detail in [50]), 2) comparing the accuracy of the existing and derived data sets, and 3) comparing the census information associated with the existing and derived sets. The study determines that the adoption of address standardization will be key in the address matching problem because of format problems (missing addresses, wrong postcode, wrong number/name), and programming problems (composite addresses 213/215, and incompatible formats) which limit text string matching, so an attempt was made to use a combination and match down to the postcode ( 15 houses) and further when possible. In a discussion about the accuracy between datasets, it is found that the derived data is actually more accurate than existing data in some circumstances, but the level of commercialization makes a difference. The determination is made that the census data underestimates in recently developed and non-domestic properties, while the ADDRESS-POINT underestimates in dense areas where the number of households exceeds the level of property subdivisions captured by AP, usually poorer areas (possibly cartographic confounding [88 ]). Errors are uncovered when multiple OSAPR points are assigned to the same unique property reference number (UPRN) when a property has been subdivided but is listed only once in the tax register (described as a programming error), and when multiple OSAPR are assigned to the same property (indicating erroneous postcode assignments in from the tax register).
[4 ] attempts to measure the accuracy of self reported retrospective data, and finds that some variables have high accuracy (residences, occupations) while some are low (food eaten, health status - weight). A good review of the current literature regarding the accuracy of retrospective data is presented, stating that most are about 80% accurate in the fields of locations and health factors, but other than tobacco, ingested substances are not as accurate. The authors develop the ``lilfegrid interview process'' and find that it works well by creating timelines for people to put personal contexts onto (external, family, residential, and occupational) and possible sources of bias are removed (leading, not told that it was focusing on childhood, and recall and archive were compiled separately and not mixed). Admittance is made that the ``gold standard'' archive data may have been wrong when it was recorded.
[75 ] discusses the difficulties of geocoding intersections from the reported crash sites in Hawaii. It presents a textbook definition of geocoding from 1998: ``Assigning geographic coordinates to features not directly referenced but that have attributes that can be converted into geographic coordinates''. The basic process of geocoding is presented (match input address a geographic reference file that has matching attributes, then interpolates along the centerline). The authors claim that ``intelligence'' is missing to analyze geographical context, or to use a street network to geocode intersections. Crashes are reported as intersections, milepost markers, or the direction and distance from an intersection, and all crashes are assigned to the nearest intersection. An excellent brief discussion of the basic geocoding process is presented: 1) matching: select segments with correct name, select the one with the right address, and 2) interpolate. Intersection geocoding is explained as either non-topological (database join on the names), or topological (connections in the network), and possible sources of errors are given: 1) write-up by police, 2) data entry (abbreviations), 3) assignment to nearest intersection, 4) a function of segment length. Types of errors are defined as 1) geodetic - problems with the reference db (which is assumed to be spatially correct), 2) coding - matching the wrong segment, 3) resolvable - those that can be fixed, and 4) unrecoverable - those that can not. A good discussion is given of the relaxation and approximation process of matching, and recommend doing it in steps (change a single attribute) and passes (change multiple) to reduce the error, as well as provide the order to do the steps, but recommend never relaxing on the name - showing that it increases the match by 8%, but increases the error by 48%. Problems with the reference files are listed: 1) intersections not intersecting - onramps, 2) missing named places, 3) multiple intersections, 4) slang, and 5) coded terms. For accuracy, the authors only check to the correct segment or not, not doing ``geodetic error'' - spatial accuracy. The process: 1) standardizes the names using a rule based approach, 2) creates layers for pseudo-intersections (onramps) and pseudo-locations 0 miles markers and named places. A hierarchical approach is used: 1) exact match with new pseudo-layers, 2) Tiger, 3) more accurate city centerlines, 4) relaxation, and 5) repeat. 200 addresses are tested: Class 1 (exact) - 86% match, 100% accuracy, 2) Relaxed type - 96% acc, 3) relax type and pre - 90% acc. The results show that misidentifying locations introduces spatial bias, and call for more research into 1) preprocessing attributes to find problems, 2) finding intersection mismatches, 3) better relaxation algorithms, and 3) integration of detailed data sets.
[107 ] presents an outline of how important GIS are for environmental epidemiological studies, and how geocoding is a crucial component. The authors review TIGER as the most commonly used data source, describe the basic interpolation process, and run an example test on a rural section of North Carolina to determine if people located closer to dump sites are more likely to have evidence of immunosuppresion than those further away. A list of addresses within block groups near the dump site was purchased from a commercial vendor, but because of the lack of completeness of the TIGER data, the study used NC Department of Education's Transportation files (for bus routing) and got 28% exact match rate, an additional 30% to the intersection, and the remaining 42% from the study participants marking on a map where they lived during blood collection. This is consistent with the NC Carolina Population Center who found that 20% of rural and 98% of urban are typically matched. The authors found the biggest problems with geocoding in their study (using the term ``address matching'' synonymously) are 1) incomplete/inaccurate reference DBs (need to update regularly), 2) lack of address standardization (need to standardize), and 3) lack of assignment of numerical addresses in rural areas (need to implement - E911). The authors also discuss the confidentiality issue as high resolution data is used, ways to ensure the confidentiality while keeping the accuracy, and other factors that influence exposure like more than one residence, work location, and weather and flow factors.
[24] presents early work on the effects of misclassifying people into the wrong enumeration districts through faulty geocoding producing inaccurate points which are then used for point in polygon to spatially link to the attributes of the enumeration district (ED) (in this case Townsend scores). This work is emulated again and again in different domains ([66], some others), and appears to be the first. The authors are the first to warn about the error propagation that can occur by the misclassification of people into the wrong groups, and how this will affect the outcomes of the research and needs to be quantified in the results. The give a very nice background about EDs and postcodes in Britain, and how their boundaries do not match and postcodes may fall into multiple EDs. The outline the development of the ED-postcode mapping database, and how these relate to the 100m grids of the Central Postcode Directory (CDP) with their associated error (bottom left corner) (similar to [81], and [38 ]). A further description of the ADDRESS-POINT dataset development and why it use is prohibitive (because it is too expensive) is given. Through a comparison of classifications using the postcode to ED and ADDRESS-POINT to ED mappings. Additionally, the degree to which the postcode-ED database has improved over the mappings produced by overlaying the 100m CPD grid is analyzed. Postcodes were assigned to addresses from the ED-postcode database at the University of Manchester, and ADDRESS-POINT was done by a commercial firm. Only 3428/3864 addresses were able to be used because only ones that were in Sheffield and matched by both methods could be employed. Townsend scores were then calculated. A nice discussion of the gross and net errors introduced by misclassification is presented, and it is determined that postcode matching is a great improvement on grid matching, but performs not nearly as well as address-point.
[17 ] examines the link between SES and susceptibility to the drug resistant strain of Streptococcus pneumonia. Geocode is performed to the block level with Census 1990 in Atlanta with a Census matcher LandView, and achieved 90% match rate, with the 10% error from misspellings and new streets not in the reference files. The authors tested 528 cases and found that the more affluent the neighborhood, the more likely to be susceptible to drug resistant strains, and postulate that it has to do with access to antibiotics for people with money. No consideration is made for the inaccuracy of geocoding having an effect on study results.
[90 ] presents the state of data problems in GIS and health for the developed and developing world. The authors give a brief survey of health applications that use GIS, and survey some work that mentions the fact that there is always some error in geospatial data. Cancer research is highlighted because of its need for historical data, and confidentiality is listed as a hindering factor in the release of information. The biggest problem in the developed world is the error in data, while the undeveloped world just plain lacks the data. Aids data is listed as the epitome of lack and inaccuracy of data. The authors question if the confidentiality constraint should be lifted to facilitate the collection and dissemination of health data.
[43 ] examines the subject loss of residences when attempting to geocode for a breast cancer study, pointing out the lack of attention this has received. The authors geocode all (11,470) breast cancer diagnosis for Connecticut from 1992-1995 with Maptitude (+-100m - not very good accuracy, large effect on small studies, and use the TIGER base). Several match types are defined: ``exact'' (name, type, and number within range), ``inexact'' (acceptable name and number within range) matches by relaxing using a hierarchy of phonetic and lexical rules, and non-matches (no name, outside range, ambiguous, missing info). Achieved rates are: 71% exact, 15% relaxed (with most due to discrepant names/types), and 14% non-matches. The study finds that rural were more likely to fail because of PO boxes, and sub urban to fail because of missing addresses in the reference file (maybe because of the rapid construction between censuses). Moreover, the study finds that the non-matches were associated with 1) characteristics of the patient (race, place of birth), 2) the residence (township characteristics, census variables), and 3) the illness (year of diagnosis), and that studies will over-sample in urban areas and under-sample in rural ones.
[53 ] discuses the process of obtaining addresses for the people who gave PO Boxes for the California Cancer Registry, and the effect that using zip code centroids has on biasing study results. The study finds that while it is possible to get the addresses from the USPS, they are not likely to improve the geocoding much over the use of zip code centroids. Also, the study finds that serious misclassification occurs when the zip codes centroids are used in place of the actual address of the person for associating SES and proximity to possible exposure. Only 47% of the addresses were received from the USPS, of which there was a 69% match rate, improving by 27% manually. The authors found that 25% of people were placed greater than 4 miles from the zip code centroid resulting in 81% case misclassification. Also acknowledges is that other sources of spatial error in epidemiological studies include address matching and spatial inaccuracy of the reference layer.
[69 ] attempts to measure the accuracy of commercial geocoding firms based on 1) accuracy, 2) cost, 3) timeliness, and 4) customer service. The authors sent a test file of 75 addresses with 50 errors (out of address range, abbreviations, misspellings, and incorrect zip codes) to several firms and evaluated the ``real world'' accuracy of the best one (based on placing the point into the correct census block). The study found that all firms used the latest street data available, several used more than one source, and the results varied widely on match rates (44-84%) on ``incorrect addresses'', and accuracy (65-100%) on correct ones. The best company achieved 95% accuracy, and the cost for 1 million addresses was $9114. The work concludes with a note that all projects using geocoded data need to evaluate and report the accuracy of the points. The accuracy measure used in the work is not very fined grained.
[93] presents a detailed analysis of the accuracy of linear interpolation geocoding in comparison to 1) parcel centroids, 2) parcel assignment, and 3) census boundary assignment. He found that 5-7.5% of 21,890 addresses were assigned to the incorrect census boundary, and that 50% were assigned to the wrong property. He cites the declining price of geocoding 330,000 addresses for $500. A brief history of the geocoding data sources is given (DIME->Tiger) and the ADDRESS-POINT db, as well as an outline of the geocoding process: linear interpolation, with an offset and inset. A number of potential problems are identified: 1) out of data street reference files, 2) abbreviations, 3) name variations, 4) duplicate addresses, 5) non-existent addresses, 6) line simplification, 7) noise in the address data, 8) non-address locations, 9) geocoding imprecision, and 10) ambiguity. He found that only 10% were in the correct parcel, 26% in another wrong one - %72 error, and accepts that TIGER was not created for the cadastral level, but for ``relatively large scale mapping capability''. Further, he calculated the distance from the parcel centroids (this is ok in his area because the parcels are small 434 m squared and the centroid will be near the house location) , and found the mean distance of 31m (using the trimmed mean from [38 ] to cut off outliers). A very good discussion is given about the effect of the offset and inset, and determines based on the distribution of angles of offset that globally varying it's the offset size alone will not improve the accuracy very much, but can improve the accuracy slightly. When comparing the census accuracy, he finds that ones on the border are misclassified, and some notion of proximity to border needs to be accounted for as well as border line simplification. He concludes with a statement that old results based on geocoded points need to be revisited to ensure their accuracy, and that rigid data entry procedures will help the accuracy of data.
[28] provide an early discussion that examines the error of the address range method of geocoding versus the parcel centroid. It states that the georeferencing process contributes uncertainty, opposed to [87 ] who defines it as a source error (in their case the geocoded points were considered a source, and not a procedure). The authors provide a description of the address range method, termed the ``street centerline model'', arguing that using this model introduces two assumptions that impact the locational accuracy: 1) ``parcel homogeneity'' referring to all addresses being equally distributed across the segment - not true in urban areas where parcels are large and divided for historical reasons and in urban areas where an apartment complex might take up most of the addresses on a segment but take up only a small piece of the street segment, and 2) that the street centerline method and street networks can be used to accurately and consistently asses residential locations - which is claimed to be incorrect because a dropback is necessary, and using a constant value does not represent reality where the dropbacks are rarely uniform. Additionally, the authors identify that the address range method will have a higher match rate than parcels with a single address because the ranges allow for the existence of false addresses which the parcels (with a single address) do not. The matching algorithm presented is quite naive, in that it tokenizes and matches pieces, but non-matches need to be corrected manually, without the help of probabilistic methods. This led to discarding 45% of their data that was not able to be matched by both methods to be comparable (the concordance). Matching errors are classified as data quality (misspelling, abbreviation) or data standardization (PO Boxes). The effect of the differences in the geocoding strategies are shown to be significant through their use as the spatial basis for exposure modeling in that they cause changes when using one method or the other. The authors note that the hard part of using parcel data is getting it, because the data does not exist or is not accessible.
[29 ] analyzes the relationship between the characteristics of a neighborhood and the likelihood of incidence of coronary heart disease. The work makes summary scores of the SES from the block groups, identified by geocoding the people to their census blocks, and rates the neighborhoods according to a variety of factors. The authors find that living in disadvantaged neighborhoods increases the likelihood of disease incidence. However, no reference to the accuracy of the geocoding is made, and the results (based on statistics) can be altered if the addresses are placed in the wrong groups, an example of what not to do (don't account for geocoding errors).
[71] investigates how well the census variables at the block level can be used to represent individual characteristics in North Carolina. The study used 71,682 women from a mammogram screening program with self reported race, and educational status and compared that to what they would be classified as using their geocoded census block values, as well as the urban/rural values from geocoding to the zip code level. The block level geocoding was dismal 45% with TIGER alone, 55% with the addition of North Carolina's Transportation Information System. Zip code geocoding worked for 95% (69% in urban, 25% in rural). The results found that minorities will be highly misclassified using census values, but rural/urban was not a cofounder for race, but was for education. The study also tested whether an analysis based on estimated data would be affected, and it was. It is mentioned that bias could be introduced because the rural match rate is so much lower than the urban (cartographic confounding, [88 ]).
[68 ] argues that the use of postal zip codes and census zip code tabulation areas should be done very carefully when linking with SES data from the census. The main problems outlined are: 1) the spatiotemporal variability of each that need to be accounted for when using data from one year to collect SES data from the census of another, and 2) that they are not the same thing, in particular, they have different spatial boundaries.
[7 ] attempts to determine the reasons 1) why geocoding to the tract level fails, and 2) how much effort it takes to improve the results. Geocoding is defined as `` the act of assigning additional geographic information to a case report''. A nice discussion is presented about the choice of scale depending on 1) type of research question, 2) data limitations, 3) confidentiality concerns, and why census tracts are good choices for epidemiological studies because they: 1) identify neighborhood patters, 2) are uniform population (less bias), and 3) are confidential, but are bad because studies can be biased toward high density areas because the geocoding to census tract works less well in rural areas (rural routes and PO boxes). Two studies were performed; the first took 500 random cases and geocode - failures go into four categories 1) previous match (assigned to previous census tract match) 2) complete and correct address, but missing from reference db (assigned manually from address-census crosswalk file), 3) minor errors - misspellings, transpositions (corrected and re run), 4) incomplete and unable to handle in (2) (consulted additional sources - 20-25/hour). The second phase geocoded 85,863 previously non-geocodable ones by 1) cleaning out extraneous characters and white spaces, and 2) standardizing common abbreviations, and 3) matching zip+streetname+range or out of range when whole street fell in a single track. In the first test, 80% match was achieved automatically, and 19% by hand (totaling 99%). The second got 18% of previously unmatched. The results found that 1) geocoding can be improved with very little work, 2) is dependent on the reference file (multiple should be used), 3) that systematic problems can be corrected in batch (same as what the thesis girl said), 4) typically fails because of minor errors, and 5) that relaxation can be used to improve results, but may cause problems for streets that are very similar.
[32] performs a study of the error associated with geocoding rural areas, in particular farms, focusing on the effects for veterinary epidemiology. For discrete (single parcel) farms, the study investigates 1) parcel centroid (determined through Green's theorem), 2) geocoding based on postcode random point, postcode sector random point, code point, and address point, 3) parish code random point, and 4) referencing the building from imagery. A circle around the point chosen is used to determine the amount of actual farmland that is included. Evaluations show that the centroid performs the best (reasons are given as to why using a circle is an appropriate method, because of the relationships between the lengths of the perimeter). For complex (multi-parcel) farms, the study investigates: 1) mean centroid of all parcels, and the weighted mean based on area. The results find that the centroid works the best at capturing the most area of the actual farm, but the claim is made that building georeferencing is the most ``practical'' method, which is backed by a few reasons; mainly that it is the center of activity of the animals. The poor performance of actual farm kept by the building method shows the obvious fact that farmhouses are not typically in the center of the parcels, which is not mentioned. Additionally, solid evidence is provided that postal geocoding in rural areas is a big problem, and is especially compounded when the farm consists of multiple parcels, which can also be a problem in urban areas. The authors conclude with some interesting comments about how random points in the postcode are ok for low resolution maps, and that there is a need to asses the quality of the address data before using it in studies (similar to [69 ]).
[6] presents the first study of the positional accuracy of geocoding for use in the epidemiological studies. The authors use a very small data set (n=200) and geocode (what are mostly historical) addresses in Western New York State. Geocoding is defined as the ``process whereby the relative positions of addresses are linked to a reference theme, which is a database that contains both address information and locational information''. Repeating from [30] and [69 ], it is stated that the accuracy of geocoding will depend on both the match rate and the accuracy of the process of geocoding. The authors tested 1) the accuracy of the historical addresses, 2) whether the inaccuracy of geocoding will produce different results, and 3) if there are differences in the accuracy in rural vs. urban areas using ESRI and GDT (tiger base) to geocode. Measurements of the offset (Euclidean distance) from the GPS coordinates taken from the street in front of the residence were taken and found that historical addresses were typically more accurate (30-40m) than current (38-58m), but the ones that were erroneous had the greatest error of them all (>1000 m), while the average error was <100m for the majority of all addresses, but urban was better (32m) than rural (52m). The sources of error identified were from the reference files (TIGER) and from the interpolation process - noting that long streets had more error, as well as possible error from the GPS measurements. The authors conclude with a nice discussion of how some studies will not be affected by this possible error, but others will, and note that future studies will need to account for the possible error in their conclusions.
[15] presents a study of the positional accuracy of geocoding regarding 1) the evaluation of positional error - Euclidean distance between ground truth as picked from images, 2) how the error varies between different classes of population density -urban, suburban, and rural, and 3) how using parcel data (centroids) improves the results. This paper has excellent references. It provides a nice collection of studies that use geocoded data, and describes the geocoding interpolation process very well, and how the possible sources of error can affect the outcomes of study results. The authors define the match rate, and give several references that have studied it as well as how higher match rates can be achieved (with minimal efforts) at the expense of lowering the positional accuracy greatly (from [78]), noting that the accuracy will depend on the actual address data, as well as the reference file, and how the level of misclassification (into whatever) will depend on the areal unit chosen, and give the few studies that have looked at positional error. Land is classified into urban, rural, and suburban based on the population density, and a random sample of 1000 addresses from each (3000 total) was taken from the whole 215,007 addresses in the population, of which 81% were successfully geocoded using MapMaker and GDT. The house was marked on imagery and QA/QC'ed them with multiple people, with the average between people being 3.3m. The study found the largest problem marking them when the houses were close together, from trees, and garages. The study used the default setting for corner and lot insets, and found, through testing 5m iterations, that turning them did not really have a large effect (increased accuracy 7m), as did [93], and that the direction error was uniformly distributed. The authors found large differences between the accuracy of the three regions 95% - mean were within: rural 2872m - 614, suburban 421m - 143, urban 152m - 58, and largest of errors were due to inaccurate zip code boundaries. The use of parcels greatly improved the accuracy (especially in the rural) with 95% < 195 and: rural 55, urban 21, and suburban 39. Sources of error in the interpolation methods are identified: 1) incorrect address ranges, 2) longer street segments - higher error, 3) the non-homogeneity of parcel distribution, and 4) non-homogeneity of offsets (2, 3, 4 are more likely in rural areas). Also the ``spatial non-stationarity process'' being evident when the mean and variance differ by location (- cartographic confounding), and how this may effect bias in studies, as well as the possibility of systematic error in certain locations (when all addresses on a street are messed up the same amount) is not randomly dispersed (the thesis girl copied this), and that the accuracy of the TIGER data is not the same across all regions (again cartographic confounding). Like [93], it was found that optimizing the parameters has little effect on the accuracy. Also, the parcel centroids were more accurate in urban areas rather than rural ones, but the addresses on parcels are non-standardized leading to lower match rates, and that using the E911 would be good except that each county has their own system (as opposed to [74] paper). The authors conclude by echoing [69 ] that all studies using geocoded data need to assess and report on the accuracy of their underlying geocoded data.
[83 ] reports on the multi-pronged methods that the authors used, creating an iterative geocoding process, to increase the match rate in the geocoding of 14,804 people in Wisconsin. The authors managed to geocode 97% to the addresses and the remaining 3% to the zip code centroid. The steps used were 1) standardize the address to USPS, 2) geocode with two versions of TIGER, 3) use internet searches to get more complete address info, 4) use web maps to geocode, and 5) re-contact the study participants. The study found that the average cost was $1 per address, with 1/2 the cost for the programming, 1/4 for the updating of incomplete/inaccurate addresses, and 1/4 for re-contacting people. Geocoding is defined as 1) matching the address to the correct segment in the reference files, and 2) interpolation - it is claimed that address matching is the same term. The main problems in geocoding were identified as: 1) misspelling, 2) changing addresses over time, 3) PO Boxes, and 4) errors in the reference files. The authors found that using two versions of TIGER (1995 and 2000) helped with historical streets, and the 1995 was more accurate for rural addresses. The study allowed for an 80% match in terms of spelling and sensitivity in the ESRI software (probability based with weighting of the address components) used, but recognized that this will lower the accuracy of the results. A hierarchy was developed for when the address matched differently in different sources - 1) tiger 2000, 2) tiger 1995, 3) intersection, and 4) zip code centroid. The census designations of urban/rural were used and the study found that rural was more likely to not match. Match rates of 76% total were achieved, 69% from 2000, and 7% 1995 on the already standardized addresses, and improved 80% of the unmatchable using online data, and got sufficient information on 45% of the 597 PO Boxes by re-contacting. The study reports that the intersection geocoding was 80% within a quarter mile, but doesn't mention how it was ground-truthed. The authors point out that possible sources of error are from: 1) positional accuracy of tiger, 2) the point in polygon assignment of census tract, and 3) the temporal nature of addresses. No discussion of the positional error is offered, only an interactive way of increasing the match rate.
[27] outline a framework for determining the accuracy of geocoded points based on the level of information available called the Geocoding Quality Index (GQI) which is an ``uncertainty driven geocoding tool'', in response to the difficulties of geocoding in Brazil. The authors begin by detailing the difficulties of geocoding in developing countries, especially with the existence of slums that have chaotic occupation and no address markings on dwellings, stating that this is more prevalent in poor areas (cartographic confounding). A very good history of addressing systems and standardization efforts (US and EU, realizing that this is critical to accurate geocoding - [111]) is given, listing the different types of addressing systems ([34 ]) that are used throughout the world: numbering, named, random numbers and no street names (Asia), parcel numbering, and the metric system with the odd/even rule - which is not distance oriented, but requires a starting point for streets and some official coordination (not possible in slums). A generalization of commonalities between schemes is presented as the existence of 1) streets, building numbering, and cities, and 2) indirect references (landmarks). The authors are quick to point out that the level of accuracy required will be application dependent, and the GQI will serve to provide a level of uncertainty for geocoded locations that these applications can use to determine if the level is appropriate and. The geocoding process is defined as three operations: 1) parsing - text/string algorithms (needs a placename database to correct to), 2) matching - find the most precise match, requires additional databases (address infrastructure - addresses->points and street centerlines with ranges, and things to resolve ambiguities - borders, neighborhoods, admin districts, gazetteers (``extend common notion of addressing''), and 3) locating - simple->just return previously geocoded match point, or complex->point has to be generated. The argument is made that the process should start at the most precise, and work down if data is not available through an accuracy hierarchy: 1) individual point - direct match to geocoded address, building name, 2) the assignment of an address to another's direct match, 3) centerline interpolation - only for metric, 4) street (thoroughfare) interpolation - when origin of street is not known, 5) reference area centroid - large park, complex, 6) street crossings, 7) neighborhood centroid, 8) postal area, 9) municipality. The GQI is defined as a ratio of the area that the geocoding can be narrowed down to over the total area of the largest area (worst case), municipality ranging from 0 (best - point match) to 1 (worst - municipality match). The choice of largest area seems arbitrary, and the measure of uncertainty is a ratio of potential areas where it could reside, not spatial accuracy. The authors could have gone further to describe the constraint of where the uncertainty resides in the area. For area based geocoding, the area of the region is used, but for all others a neat trick is done and describes the uncertainty as the area created by a circle whose radius is defined as the possible area in which the point could reside in several cases 1) line based - length of street X ``standard width'', 2) numerical approximation - (given number - found number) squared - this has big problems if the homogeneity constraint is not satisfied or for long street, 3 ) street crossings - standard width squared (size of intersection), 4) 2 street crossings (this would be the ambiguous case, but is not mentioned - this would be another source of uncertainty - someone referred to this previously) - line rule, 5) 3+ street crossings (also ambiguous) - convex hull, 5) regular numbering - line rule with half standard width because of half street offset (without noticing that it is also off the street additional distance), 6) irregular numbering - line rule but have to include all streets which could possible include the address.
[60] present the use of GIS tools to reconstruct the historical exposure to pesticides based on geocoded addresses of individuals with breast cancer in Cape Cod. The authors develop a spatial proximity tool to relate the women's addresses with environmental data about the spraying areas for pesticides. Analyze are: 1) spray drift, 2) deposition location, 3) distance between residence and sprayed area, 4) the size of the area, 5) the wind direction. The study reconstructs the ``dose'' in terms of spatial, temporal, and intensity to determine a relative intensity. The authors have to deal with data and geocoding problems mainly due to the disparate data layers that were used in each of the different cities to create the parcel maps (different sources, resolutions, and accuracies) which were used to geocode to the parcel center, mainly that the pesticide maps didn't match up directly, so the points were placed by hand onto the rooftops (only 2100 points) from areal photography (not scalable). It is pointed out that these types of data source errors will be common to all studies and further research is needed to fully understand and model these. The exposure for non-geocodable addresses that had street names was determined by calculating the exposure for each parcel on the block and taking the mean, called ``imputing''. This paper uses a method from [110 ].
[11] provides more detail about [60] where the data analysis to determine if historical pesticide exposure was statistically related to incidence of breast cancer in Cape Cod was presented. The authors detail the exposure model used based on wind direction, amount sprayed, and distance, including the problems because of missing pesticide data. Addresses were geocoded to the rooftops of houses (by hand - not scalable) when able, and to the nearest rooftop, nearest parcel center, center of street (in that order) when the house is referenced by a landmark or intersection, and to the centroid of the nearest cluster of houses when not able to identify a house in the aerial photos. The study achieved an 83% success rate, 90% to the roof center, and found that more recent addresses were more likely to geocode. See other paper for more details ([12 ])
[94 ] attempts to determine what an acceptable ``hit rate'' is for geocoding with regard to generating thematic crime maps, to model the impact of inaccuracies in law enforcement geocoding. He performs a Monte Carlo simulation of a declining hit rate combined with statistical analysis of the aggregated outcomes. He states that for crime mapping, a high hit rate is better than high precision, and gives several reasons why the hit rates are low for crime: 1) there are separate databases for incidents and crimes, 2) the reported address is not updated with the actual address, and 3) there is no verification done on the input of data, and gives common errors 1) misspelling, 2) incorrect prefix, 3) incorrect abbreviations, 4) incorrect types, 5) impossible addresses, 5) using named places, 6) omitting any address at all, 7) confusing address and unit number, and 8) extraneous text (these are cited from Harries 1999 book - Mapping Crime). This was done by starting with the creation of thematic maps with 100% of the data, removing 1% and checking for statistical differences, then repeating to account for random point selection. The amount of data removed was increased and the statistical differences between maps were displayed as a frequency distribution. The authors found that an 85% hit rate is acceptable 95% of the time, across a range of areas, admitting that while the tests assumed an equal probability of any point being missing, in real life geocoding errors are not random but instead one area may have all the points missing because of spatial characteristics (new home developments, landmarks clustering in centers of towns) - cartographic confounding. The work concludes by stating that address cleaning will greatly improve the hit rate.
[116] compares the strengths and weaknesses of three geocoding packages 1) ESRI, 2) AutoMatch, and 3) ZP4. It finds that each will give different results. The authors break the geocoding process down into 3 main components (from Cochinwala 2003 - Record Matching past present and future - GET): 1) preprocessing (parsing, standardization, and correction), 2) matching against reference data, and 3) interpolation. The go through the technical portions of the three packages 1) ESRI - deterministic scoring (the new version is a bit more probabilistic) that allows for sensitivity adjustments (increase match rate, more errors) and weighting of attributes , and has no census data - needs to be spatially linked, 2) AutoMatch uses probabilistic recode linkage ([56]), with dynamic weighting based on probabilities, which requires programming of the blocking strategies (similar to [91 ]), and 3) ZP4 standardizes and corrects addresses, but only geocodes to the zip code centroid, with census data. Resulting match rates: 82% ZP4, 86% AutoMatch, 88% ESRI. The work compares the accuracy as matching to the same census block (not very good measure), and finds that even though the match rates were high, each product varied on almost 30% of them, and that zip codes were the major source of this (because they change frequently over time), and rural areas did the worst. The authors conclude with the recommendation that addresses be standardized with USPS first, and then use accurate street reference files for matching and interpolation.
In their review on the usage of GIS in Epidemiology, [87] state that geocoding leads to source errors, which is true in a sense that the underlying reference data may be incorrect. However, the authors fail to realize that it is also susceptible to processing errors in the case of interpolation and drop back miscalculations, as well as address matching errors ([28]). The work lists the early research that performed analysis on the spatial accuracy (i.e. [69] detailing the variation in commercial geocoding, and [6] who found that addresses were usually pretty close, but rural areas were usually worse - this is related to ``cartographic confounding'' [88 ], and should be noted)
[98 ] examines the utility and positional accuracy of using a reverse phone book directory (RD) to enhance the geocoding of self reported (SR) addresses. The authors compare 1) the accuracy of the self reported address data, 2) the socioeconomic differences in the people who are in the RD, 3) the proportion of people only geocoded with the RD, 4) the accuracy between the SR and RD geocoding, and 4) how well the RD works for SR addresses that wouldn't geocode. The study randomly dialed and reached 2756 people, and could use 2636. Reverse information about addresses was gathered from white and yellow pages, and geocoded by asking the people 1) the name of their street, and 2) the address range, and geocoded to the nearest block level with SAS and Tiger - SAS gave lots of false positives because of loose constraints which had to be reviewed. 64% of people were found in the RD, and 51% of SR addresses geocoded, of which 79% were in the RD. The biggest problems with the RD were 1) out of range addresses, 2) missing numbers, and 3) invalid prefixes, all of which were more likely in the rural areas, which also more often gave rural routes, PO boxes, and highway addresses. 72% of people were able to be geocoded with info from both the RD and SR. The authors found that 80% of 804 addresses geocoded by both SR and RD were within the same census block, and 89% tract, and that only 63% of people were in the RD, and that these were not randomly distributed, instead that the demographics effect who is in the RD. Overall the study found that the quality of RD addresses were better than SR, and that the combination of the two gave the highest match rate. The positional accuracy was stated as being in the same block/tract (which is not the best measure) and it is even stated that 17% are on the border which can severely introduce bias.
[10 ] evaluates the accuracy of using the postal code (the smallest delivery unit - block, single building, or large volume mail receiver - very similar to UK - these are linked via a national file to geographic coordinates) as a proxy for the residential address health studies in Canada. The authors start with a good section on studies that use postal code locations, as well as some studies that identify the error of doing so. The study only considers urban areas, excluded rural addresses and PO boxes, and uses the CA equivalent of TIGER for street interpolation. The study was of 2947 addresses, and the address error were corrected by hand (postal code not matching address, wrong street names, incomplete address, spelling errors) and was able to match 2727/2947, achieving a 97% match rate with ESRI, with the rest missing from the reference file. The postcode was referenced from the Statistic Canada's Postal Conversion File in MS Access. The Euclidean and Manhattan (city block) distances between addresses were computed, and found that 88% of the postal codes points are within 1 block (200 m), while some were vary far off (errors in reference db or the geocoding process). A discussion is offered about how different studies require different levels of accuracy, and offer potential limitations being 1) only urban area, 2) geocoding may not have been accurate, and 3) addresses were cleaned and may over-perform actual ones.
[59] presents a detailed analysis of the sources of uncertainties involved in geocoding through the comparison of three geocoding methods (ArcGIS, TeleAtlas, and LocMatch - their own which is simply ad interpolator). The geocoding algorithm is defined as 1) searching the reference database for a match (if it was already geocoded), and 2) interpolating if no match found. The authors state that there are two views of geocoding 1) positioning a real-world location in a database, and 2) locating a database address in the real world, of which the study focuses on the first because spatial databases can support spatial queries and geometric and topological computations. The work defines a direct method of geocoding through GPS (but too expensive), and also through algorithms. These algorithms require a reference database (of geocoded features - points, lines, polygons), with attribute data that can be matched to the query address. The authors claim that interpolation is only appropriate for centerline geocoding because intersection already have the point (references [75]), and polygons have the centroid (references [93]). The study naively defines the ``address-matching component'' as ``where postal and ZIP codes are used to geocode addresses not only onto correct streets but also in correct areas'', and cites [38], and [81]. The authors repeat that the sources of geocoding uncertainty come from 1) the reference data (completeness, correctness, consistency, currency, spatial accuracy) and 2) the interpolation techniques (assumptions, method) ([28 ]). A quick survey of accuracy studies are presented, and how this study is different because 1) it looks at the uncertainties of different methods vs. GPS data, and 2) looks at the uncertainty of the reference data. A very good description of the different classes of TIGER records is presented (R1 - normal, R2 polyline, R6 out-of-sequence addresses). The address range extent is defined as the start-end addresses and develop several overlap schemes to address gap situations (complete, partial, no), and different extents on both sides of the street. The LocMatch algorithm is presented (which is essentially just the interpolation and handles polylines without any probabilistic address matching; 4 cases - is endpoint - return it, on normal street - interpolate, on polyline street - interpolate, out of sequence - assign to closest endpoint), as well as explain how to use ESRI (standardization, scoring) and TeleAtlas (no scoring, CASS - goes to zip centroid if no address match). 132 addresses with no centerline or end offset are geocoded, and compared to the GPS coordinates (side of road, in front of house, offset to centerline). A very nice discussion of GPS error is presented, showing 99% confidence at 45m. The Euclidean distance is then calculated, and the match rate is defined in an interesting notation. Analysis of Variance was used to show that the results from the