awesome-architecture-mds/data-analytics/usaddress/Address_Pre_processing_Feature_Extraction.md at main · CodeBoarding/awesome-architecture-mds

graph LR
    tokenize["tokenize"]
    tokens2features["tokens2features"]
    tokenFeatures["tokenFeatures"]
    trailingZeros["trailingZeros"]
    digits["digits"]
    tokenize -- "The output (list of tokens) from `tokenize` serves as the direct input for the `tokens2features` component." --> tokens2features
    tokens2features -- "For each token in the input list, `tokens2features` invokes `tokenFeatures` to extract individual token characteristics." --> tokenFeatures
    tokenFeatures -- "Calls `trailingZeros` to determine if a token contains trailing zeros." --> trailingZeros
    tokenFeatures -- "Calls `digits` to analyze the numeric properties of a token." --> digits

Details

The usaddress library's core functionality for parsing address strings is centered around a pipeline that transforms raw input into structured, tagged address components. This process begins with the tokenize component, which breaks down the address string into individual tokens. These tokens are then fed into the tokens2features component, responsible for generating a rich set of features for each token, leveraging the tokenFeatures component. The tokenFeatures component, in turn, utilizes helper functions like digits and trailingZeros to extract specific numeric and pattern-based characteristics. This feature engineering prepares the data for a probabilistic model, enabling accurate classification and structuring of address elements.

tokenize

This component is responsible for the initial transformation of a raw, unstructured address string into a list of discrete, ordered tokens. It acts as the entry point for the pre-processing pipeline, breaking down the input into manageable units using regular expressions.

Related Classes/Methods:

tokenize:731-749

tokens2features

This component takes the list of tokens generated by tokenize and transforms them into a comprehensive feature set. It orchestrates the feature extraction process for the entire sequence of tokens, preparing the data for the probabilistic model by adding 'previous' and 'next' token features.

Related Classes/Methods:

tokens2features:785-807

tokenFeatures

This component is responsible for analyzing a single token and extracting a rich set of features relevant for address parsing. These features capture various characteristics such as capitalization, numeric properties, and specific patterns, which are crucial for the probabilistic model's classification.

Related Classes/Methods:

tokenFeatures:755-782

trailingZeros

A utility component that checks if a given token (specifically a numeric one) contains trailing zeros. This specific feature can be important for distinguishing certain address components (e.g., "100" vs. "10").

Related Classes/Methods:

trailingZeros:820-825

digits

A utility component that analyzes the numeric composition of a token, identifying if it contains digits and potentially other numeric-related characteristics. This helps in classifying tokens as numbers, postal codes, or other numeric address elements.

Related Classes/Methods:

digits:810-816

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details

tokenize

tokens2features

tokenFeatures

trailingZeros

digits

FAQ

FilesExpand file tree

Address_Pre_processing_Feature_Extraction.md

Latest commit

History

Address_Pre_processing_Feature_Extraction.md

File metadata and controls

Details

tokenize

tokens2features

tokenFeatures

trailingZeros

digits

FAQ