Skip to content

Latest commit

 

History

History
104 lines (58 loc) · 6.15 KB

File metadata and controls

104 lines (58 loc) · 6.15 KB
graph LR
    Address_Parsing_API["Address Parsing API"]
    Tokenizer["Tokenizer"]
    Feature_Extractor["Feature Extractor"]
    Probabilistic_Model["Probabilistic Model"]
    Address_Tagger_API["Address Tagger API"]
    Model_Management["Model Management"]
    Training_Data_Manager["Training Data Manager"]
    Configuration_Manager["Configuration Manager"]
    Address_Parsing_API -- "calls" --> Tokenizer
    Address_Parsing_API -- "calls" --> Feature_Extractor
    Address_Parsing_API -- "calls" --> Address_Tagger_API
    Tokenizer -- "provides tokens to" --> Address_Parsing_API
    Feature_Extractor -- "provides features to" --> Address_Parsing_API
    Address_Tagger_API -- "calls" --> Probabilistic_Model
    Address_Tagger_API -- "provides results to" --> Address_Parsing_API
    Model_Management -- "loads" --> Probabilistic_Model
    Model_Management -- "interacts with" --> Training_Data_Manager
    Training_Data_Manager -- "provides data to" --> Model_Management
    Configuration_Manager -- "provides settings to" --> Model_Management
    Configuration_Manager -- "provides settings to" --> Address_Parsing_API
    click Address_Parsing_API href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/usaddress/Address_Parsing_API.md" "Details"
Loading

CodeBoardingDemoContact

Details

The usaddress library's architecture is designed for robust address parsing. The Address Parsing API serves as the central entry point, orchestrating the entire process. It first utilizes the Tokenizer to break down raw address strings into individual tokens. These tokens are then processed by the Feature Extractor to generate a rich set of features. The Address Tagger API then takes these features and interacts with the Probabilistic Model (a CRF model) to assign specific address component tags to each token. Model Management handles the loading and lifecycle of the Probabilistic Model, relying on the Training Data Manager for labeled data and the Configuration Manager for operational settings. This modular design ensures a clear separation of concerns, facilitating maintainability and future enhancements.

Address Parsing API [Expand]

The primary public interface and orchestrator of the usaddress library. It receives raw address strings, coordinates the entire parsing workflow (including tokenization, feature extraction, and subsequent tagging), and returns structured address data. This component embodies both the "Input/Output Interface" and the "Parsing Engine Orchestrator" concepts.

Related Classes/Methods:

Tokenizer

Responsible for breaking down raw input address strings into individual tokens (e.g., words, numbers, punctuation). This is the first step in the parsing pipeline, preparing the input for feature extraction.

Related Classes/Methods:

Feature Extractor

Transforms the tokens generated by the Tokenizer into a rich set of features (e.g., word shape, capitalization, contextual information) that the Probabilistic Model can effectively use for accurate address component tagging.

Related Classes/Methods:

Probabilistic Model

The central "Probabilistic Model" component, likely implemented using python-crfsuite. It applies a trained Conditional Random Field (CRF) model to the extracted features, assigning specific address component tags (e.g., StreetName, City, State) to each token. This is the core intelligence of the parsing process.

Related Classes/Methods:

Address Tagger API

A specialized interface or internal component focused directly on the tagging process. It takes prepared features and interacts with the Probabilistic Model to perform the actual tagging of address components. It can be called by the Address Parsing API or potentially used for more granular control.

Related Classes/Methods:

Model Management

Responsible for loading, saving, and managing the lifecycle of the Probabilistic Model. This component ensures that the correct and most up-to-date model is available for the parsing and tagging operations, and facilitates model training and updates.

Related Classes/Methods:

Training Data Manager

Manages the collection, preparation, and loading of the labeled training data used to train and improve the Probabilistic Model. This component is fundamental to the "Data-Driven" aspect of the library, enabling continuous model improvement.

Related Classes/Methods:

Configuration Manager

Handles the loading and management of various configuration settings for the entire library, such as paths to model files, default parsing behaviors, and other operational parameters. It ensures the library operates according to specified settings.

Related Classes/Methods: