graph LR
tokenize["tokenize"]
tokens2features["tokens2features"]
tokenFeatures["tokenFeatures"]
trailingZeros["trailingZeros"]
digits["digits"]
tokenize -- "The output (list of tokens) from `tokenize` serves as the direct input for the `tokens2features` component." --> tokens2features
tokens2features -- "For each token in the input list, `tokens2features` invokes `tokenFeatures` to extract individual token characteristics." --> tokenFeatures
tokenFeatures -- "Calls `trailingZeros` to determine if a token contains trailing zeros." --> trailingZeros
tokenFeatures -- "Calls `digits` to analyze the numeric properties of a token." --> digits
The usaddress library's core functionality for parsing address strings is centered around a pipeline that transforms raw input into structured, tagged address components. This process begins with the tokenize component, which breaks down the address string into individual tokens. These tokens are then fed into the tokens2features component, responsible for generating a rich set of features for each token, leveraging the tokenFeatures component. The tokenFeatures component, in turn, utilizes helper functions like digits and trailingZeros to extract specific numeric and pattern-based characteristics. This feature engineering prepares the data for a probabilistic model, enabling accurate classification and structuring of address elements.
This component is responsible for the initial transformation of a raw, unstructured address string into a list of discrete, ordered tokens. It acts as the entry point for the pre-processing pipeline, breaking down the input into manageable units using regular expressions.
Related Classes/Methods:
This component takes the list of tokens generated by tokenize and transforms them into a comprehensive feature set. It orchestrates the feature extraction process for the entire sequence of tokens, preparing the data for the probabilistic model by adding 'previous' and 'next' token features.
Related Classes/Methods:
This component is responsible for analyzing a single token and extracting a rich set of features relevant for address parsing. These features capture various characteristics such as capitalization, numeric properties, and specific patterns, which are crucial for the probabilistic model's classification.
Related Classes/Methods:
A utility component that checks if a given token (specifically a numeric one) contains trailing zeros. This specific feature can be important for distinguishing certain address components (e.g., "100" vs. "10").
Related Classes/Methods:
A utility component that analyzes the numeric composition of a token, identifying if it contains digits and potentially other numeric-related characteristics. This helps in classifying tokens as numbers, postal codes, or other numeric address elements.
Related Classes/Methods: