Smart Address


SMARTaddress

SMARTaddress is an integrated application for cleansing Romanian address data. The individual functions are:

Address cleansing Identifying the individual address components (county, locality, street type, street name, etc.).
Address standardisation Correcting any mis-spellings, adding diacritics, ensuring conformance to standard abbreviations, taking account of name changes, etc.
Postcode attribution Assigning the correct 6-digit postcode from Poşta Română.
Geocoding (optional) Assigning coordinates to the address.

Address cleansing

The individual address components are –

  • County
  • Locality (or localities)
  • Street type
  • Street name
  • Premises number (Nr., Bl., Sc, Et, Ap.)

In many databases these components are contained in just one field or, even when in separate fields, there are often in a different (and variable) sequence. Additionally, the 'raw' databases often contain a variety of abbreviations and spelling errors.

So, the first step in address cleansing is to parse the addresses (analiza sintaxei) to separate out the individual components and to assign them to their own individual fields (county, locality, street type, street name, etc).

Once the parsing is complete, then each individual address component is checked for spelling accuracy and to confirm that it is a valid address within that county (in the case of localities), or within that locality (in the case of streets).

Due the huge variety of incorrect spellings and abbreviations, a variety of techniques are used to identify the correct county, locality and street name: these include stochastic string matching and phonetic (or fuzzy) matching. There is an additional problem of missing (or incorrect) county names: these are identified by a process of 'reverse inference' from a knowledge of the locality and also the street address in the case of ambiguities.

There is a particular issue with localities because there is a hierarchy of localities which, within SMARTaddress, are called –

  • Locality 1 County town (Iaşi, Braşov, etc); commune.
  • Locality 2 Secondary town or village (satul).
  • Locality 3 Minor locality (zona, complexul, incinta, etc.).

This hierarchy is defined within the SIRUTA standard published by INS (Institutul Naţional de Statistică). SIRUTA = Sistemul Informatic al Registrului Unităţilor Teritorial – Administrative). Once an address record has been cleansed it is compared with a Reference Database (see below) to ensure that it conforms to a standard format and is fully up-to-date in terms of diacritics, name changes, etc.

Address standardisation

SMARTaddress has within it a Reference Database of over 120,000 counties, localities, place names, and street names. The localities include all the official names published by INS, but also other names which are in common use e.g. complex, colonia, zona, etc. It also contains an extensive lexicon of street types and 'decorative' &/or 'courtesy' data such as Doctor, Profesor, Aviator, Episcop, etc.

Additionally, because of the very common use of abbreviations (e.g. Rm. Vâlcea, DT Severin, etc), there is a substantial look-up table of abbreviations and alternative forms. The Localities' database respects the SIRUTA hierarchy of localities published by INS. In certain cases, this results in three 'levels' of locality. For example:

  • Stradă Principală Street
  • Porţile de Fier II Locality 3
  • Balta Verde Locality 2
  • Gogoşu Locality 1
  • Jud. Mehedinţi County

Street types and Street names are those defined by either Poşta Română or ROAEP (România Autoritatea Electorală Permanentă).

In SMARTaddress, street names are standardised according to the ROAEP methodology i.e. given name, followed by family name. For example:

Vasile Milea – as opposed to Milea Vasile, as defined by Poşta Română.

House numbers are always in the form:

Nr., Bl., Sc., Et., Ap. – but only when the data is available.

The Reference Database contains a table of over 75,000 equivalents between Premises numbers (Nr.) and Block numbers (Bl.) so, for example, if only a Block number (Bl.) is provided, SMARTaddress will also insert the correct Premises number (Nr.).

Postcode attribution

Poşta Română issues a list of postcodes (INFOCOD) for approximately 25,000 streets and street segments defined by their numbers along the street. However, SMARTaddress contains details for over 100,000 streets and street segments: so, in order to assign the correct postcode to streets which exist but which are not in the INFOCOD database, Geo Strategies has created maps and assigned postcodes for the 75,000 streets not identified by Poşta Română. This process results in 6-digit postcodes for all Romania's streets and street segments, and also for those (mainly rural) localities or communes that have just a single postcode.

Geocoding

Geocoding is the process of identifying and attaching a set of coordinates to the address. The coordinate reference system is WGS-84 which is identical to that used for GPS survey, Google Earth, etc. The most accurate coordinates possible are allocated to cleansed addresses. The coordinate accuracy depends upon the accuracy of the address information provided. The accuracy codes are as follows:

ADD geocoded to individual premises ± 30 metres
SSC geocoded to street segment centroids ± 75 metres
SC geocoded to street centroids ± 150 metres
DC geocoded to district (or cartier) centroids ± 1 Km
LC geocoded to locality centroids ± 2 Km

Notes:

  • 1. Individual centroids are snapped to the street centre-lines. So, in the case of a curved street, the centroid may well lie outside the street itself; however, after snapping, the coordinates will lie on the street (or street segment) centre-line.
  • 2. The accuracy figures are quoted as typical (or mean) accuracies in terms of the error between the actual coordinates and the coordinates provided. For example, in the case of a rural locality, the assumption is that the majority of the buildings for which there is an address lie within 2 Km of the locality centroid.

Output

The output from SMARTaddress can take a variety of forms:

  • 1. The raw (or original) data is always retained and output so that the cleansed and standardised addresses can be compared with the original information.
  • 2. The text can be of the following forms:
  • Proper case or capitals
    Note: for Proper case, names have a leading capital letter and all subsequent letters are lower case. Words such as 'de', 'şi' etc. are always lower case.
  • With diacritics or without diacritics.
    Note: the correct spelling and diacritics are always returned regardless of the input. For example Tincabesti would be changed to Tâncăbeşti.
    Within the database, all address components are held without any abbreviations.
    However, at the output, courtesy titles and similar (Profesor, Doctor, General, etc.) are output with standard abbreviations.

A standard report is always created at the end of at address cleansing process. This contains a schedule of success rates for the individual components of the address matching process, together with a table of error codes i.e. to identify the various errors encountered within the original database.