ANA datasets

Updated: 2026-06-10

This documentation technically describes how the registration data of hydrological stations made available by the National Water and Basic Sanitation Agency (ANA) are obtained, processed, and structured. The information is accessed directly through the public API of the ServiceANA system, available in ServiceANA, through the HidroInventário section. The developed algorithm sends HTTP GET requests, interprets the XML response, standardizes fields, corrects inconsistencies, and organizes the data into structured files, ensuring repeatability and data quality in the processing of both streamflow stations (stage/discharge) and rainfall stations (precipitation).

To download the detailed information for each station, it was first necessary to identify all stations registered in the system, since the individual code of each station is mandatory in the requests. Therefore, before collecting the specific data for the two station types considered, the script performs an initial scan of the complete inventory through automated requests, applying techniques for extraction and programmatic reading of the content returned by the API. This preliminary survey underpins the entire workflow for obtaining the registration data.

List of stations registered by the ANA 

The ANA system contains records of different categories of hydrometeorological stations, including Telemetric Stations — automatic stations with real-time remote data transmission — and Conventional Stations, which rely on manual measurements performed by observers and hydrology technicians. To identify all available stations, the survey was based on the complete listing provided by the HidroInventário [1] [2], which includes both conventional and telemetric stations and centralizes the registration information required for data processing.

In addition to providing the complete list of stations, the HidroInventário supplies attributes that allow the calculation of the start date of operation for each unit, an essential piece of information to determine which stations have a minimum operational history and, therefore, have their data organized and prepared for subsequent download attempts. This pre-processing step ensures that only stations with sufficient operational time are considered in the data collection workflow.

Note

Only stations with a start of operation on or before 01/01/2024 were considered, ensuring that each station has at least one potential year of complete data, taking 31/12/2024 as the final reference date.

Syntax and requests 

Access to the station inventory is performed through HTTP GET requests sent to the HidroInventario endpoint. Each request accepts parameters that act as optional filters: when a parameter is left blank, the service returns all compatible stations. The script automates the construction of these URLs, handles retries in case of failures, applies pauses to avoid server overload, and can perform both nationwide queries and queries by state (UF). The main accepted parameters are:

codEstDE: Initial code of the station range (8 digits).
codEstATE: Final code of the station range (8 digits).
tpEst: Station type (1 = streamflow; 2 = rainfall).
nmEst: Station name.
nmRio: Name of the monitored river.
codSubBacia: Sub-basin code.
codBacia: Basin code.
nmMunicipio: Municipality name.
nmEstado: State name.
sgResp: Acronym of the responsible agency.
sgOper: Acronym of the operating agency.
telemetrica: Indicates active telemetry (1 = yes; 0 = no).

These filters form the basis for reconstructing the national inventory with accuracy and flexibility.

To ensure query stability, the algorithm performs requests by state (UF), using the nmEstado parameter written in full and with the exact spelling registered by ANA (for example, BAHIA). This information is combined with the desired station type (tpEst = 1 or 2), so that the service returns all stations of that type within the requested state. This strategy avoids freezes and empty responses caused by API inconsistencies and allows the complete inventory to be collected in a safe and systematic manner.

Configuration and auxiliary functions 

Before data collection, the script defines the official ANA endpoint, along with a set of auxiliary functions responsible for handling the returned content. These include routines to extract values from XML tags, convert numbers received as text, parse dates in different formats, and correct coordinates that are returned without a decimal point or that present values outside the limits of Brazilian territory. These functions form the foundation of the pipeline, allowing the data to be processed in a stable and automated manner regardless of variations in the response formats provided by the API.

The inventory registration fields for ANA stations (Types 1 and 2) consist of a fixed set of fields representing operational metadata, including location (state, municipality, watershed), operator attributes, types of instruments installed, installation and removal dates, operational status, and other related information. The script maintains a single, consolidated list of all these field names, ensuring that each station is registered using the same structure and that different queries return mutually compatible tables.

Note

The data dictionary for the variables present in the complete ANA registration listing will be made available at a later stage, after confirmation with ANA regarding the description of certain variables for which no documentation could be identified.

Candidate start and end dates of operation 

The returned data include multiple columns indicating the start and end times of operation for specific instruments (rain gauges, water level recorders, telemetric sensors, among others). Since these components may be installed at different times, the script identifies, for each station type, which columns should be taken into account. It then computes the earliest available date as the station start date and the latest available date as the end date. This procedure standardizes the station’s operational period even when ANA does not provide this information in a consolidated form.

To derive the start and end dates of operation for each station type, the following variables were used:

Streamflow stations (Type 1)		Rainfall stations (Type 2)
Start dates	End dates	Start dates	End dates
`PeriodoEscalaInicio`	`PeriodoEscalaFim`	`PeriodoPluviometroInicio`	`PeriodoPluviometroFim`
`PeriodoRegistradorNivelInicio`	`PeriodoRegistradorNivelFim`	`PeriodoRegistradorChuvaInicio`	`PeriodoRegistradorChuvaFim`
`PeriodoDescLiquidaInicio`	`PeriodoDescLiquidaFim`	`PeriodoTelemetricaInicio`	`PeriodoTelemetricaFim`
`PeriodoQualAguaInicio`	`PeriodoQualAguaFim`	`PeriodoClimatologicaInicio`	`PeriodoClimatologicaFim`
`PeriodoTelemetricaInicio`	`PeriodoTelemetricaFim`

The variables listed above were used to derive two standardised temporal attributes for each station: DataInicioOperacao and DataFimOperacao. For each station type, the earliest start date across all available instrument period fields was taken as DataInicioOperacao, while the latest end date was taken as DataFimOperacao. The derivation procedure is described in detail in the sections Candidate start and end dates of operation and Start and end dates of operation.

Inventory request and parsing 

The collection of registration data begins by building an HTTP GET request to the HidroInventario endpoint. The function first validates the requested station type and then constructs the full URL, including all parameters supported by the service. Only the required parameters (such as tpEst and nmEstado) are populated, while the remaining ones are left blank to avoid unintended filtering. The request also includes a user identifier and automatic retry mechanisms with an increasing backoff interval, ensuring stability even in the presence of temporary failures on the ANA server.

After submitting the request, the function checks the HTTP status code and, if the response is not successful, logs the error, optionally applies a pause, and returns an empty table. When the response is valid, the XML content is parsed, namespaces are removed, and the function searches for an <Error> node; if a specific error message is present, it is displayed and the processing for that state (UF) is safely interrupted. Next, the function identifies all <Table> nodes—each representing a station—and converts their fields into a tabular structure using the fixed list of attributes provided by the inventory. Finally, the received records are consolidated into a raw table and a tipo_estacao column is added to indicate the queried station type, preparing the dataset for the next cleaning and standardization steps.

Data cleaning and type casting 

After extracting the station registration data, the script applies a standardization step to ensure consistency and usability of the dataset. Initially, geographic coordinates are processed: latitude and longitude values received without a decimal point are corrected and then converted to numeric values. In parallel, fields representing operational codes or indicators are cast to integers, ensuring that such information does not remain stored as text.

Next, all columns related to operational periods—those starting with Periodo and ending with Inicio or Fim are converted to date format, while administrative columns such as UltimaAtualizacao, DataIns, and DataAlt are converted to date-time values. Finally, textual fields undergo a cleaning process that removes trailing and leading whitespace and replaces empty strings with missing values. This structured routine ensures that all data types are coherent and that the resulting table is suitable for subsequent processes, such as date calculations, deduplication, and geographic validation.

Start and end dates of operation 

After standardizing the date columns, the script consolidates the operational period of each station through two distinct routines. The function responsible for determining the start of operation identifies, for each station type, the set of potentially relevant columns (for example, start dates of rain gauges, water level recorders, or telemetry systems) and computes the earliest valid date among them, assigning the result to the DataInicioOperacao column. This procedure ensures that the station start date represents the earliest moment at which any associated instrument began recording data.

Complementarily, the end-of-operation calculation uses the corresponding candidate columns related to instrument decommissioning and selects the latest available date for each station, resulting in the DataFimOperacao column. This process is performed separately for each station type, respecting its specific set of fields. The outcome is two consolidated columns that describe the historical operational interval of the station, serving as a reference for temporal filtering and for analyses that depend on the duration or continuity of measurements.

This procedure is primarily intended to support the data download process by ensuring that only stations operating within the climatological normal period of interest are considered. However, cases were observed in which stations present a computable value for the DataFimOperacao variable while the Operando variable indicates a value of 1, characterizing an inconsistency in the registration data. Therefore, only the DataInicioOperacao column is used as support for filtering stations that began operation within each climatological normal period when performing data downloads.

Station-level deduplication 

The ANA registration inventory may include multiple records for the same station, usually resulting from administrative revisions or updates performed at different points in time. To ensure that only the most up-to-date registration is retained, the script applies an ordering procedure based on administrative dates provided by ANA itself. Records are first grouped by station code and then prioritized according to the presence and recency of DataAlt, DataIns, and UltimaAtualizacao, always favoring records in which these fields are populated and, among them, the most recent ones.

After this ordering step, the algorithm selects only the first record for each station code, ensuring that each station is represented by a single consolidated entry. This process removes duplicates, prevents inconsistencies arising from outdated registration versions, and produces a coherent final dataset suitable for subsequent analyses and for integration with other components of the inventory.

Geographic validation 

Geographic validation ensures that the registered coordinates of the stations are consistent with the actual boundaries of Brazilian territory and with plausible elevation values. The implemented procedure identifies coordinates that fall outside the expected range, corrects straightforward cases, and flags situations in which the station location cannot be determined reliably. In addition, the script creates indicator variables that allow quick identification of stations with potential spatial inconsistencies. The following subsections detail the adopted criteria.

Altitude validation 

The altitude reported for some stations may present values incompatible with real-world conditions, either due to data entry errors or registration issues. The algorithm applies three main rules:

Slightly negative altitudes (between –10 and 0 meters) are adjusted to 0, as they may represent minor measurement inaccuracies.
Altitudes lower than –10 meters or higher than 3000 meters—above the highest point in Brazil—are considered invalid and replaced with NA.
Valid altitudes remain unchanged.

These rules allow common registration errors to be corrected without discarding potentially useful information.

Latitude and longitude validation 

To verify whether a station is located within the geographic boundaries of Brazil, the script compares its coordinates against the known minimum and maximum ranges: approximately between –33.75° and 5.27° latitude, and between –74.00° and –28.83° longitude. If a coordinate falls outside this range, the station is flagged as having a suspicious location. This assessment generates two additional columns, lat_fora and lon_fora, which take the value 1 when the coordinate is outside the defined limits and 0 otherwise. These variables support dataset auditing and enable subsequent analyses of potential spatial inconsistencies.

Main processing pipeline 

The function responsible for the main pipeline orchestrates and executes all preceding steps in an integrated manner for one or more station types. Based on the tipos ("1" and/or "2") and uf parameters, it controls how requests are issued: when uf is NULL, a single nationwide request is performed per station type; when a vector of UFs is provided, one request per state is executed, using the full state name with the exact spelling expected by ANA. In each iteration, the raw data returned by HidroInventario are accumulated and combined into a single table per station type.

The workflow then sequentially applies the post-processing routines: data cleaning and type casting, calculation of consolidated start and end dates of operation, deduplication by station code, and geographic validation. After that, an optional filter based on DataInicioOperacao (controlled by limite_inicio) is applied, and summaries of the number of stations per UF are generated, both before and after the temporal filter. When the exportar parameter is enabled and dir_export is provided, the filtered data are written to .parquet files (with gzip compression) and .xlsx files, with filenames differentiated by station type. Finally, the function returns a list in which each station type contains the fully processed dataset, the filtered version, and the corresponding summaries by UF, ready for use in environmental, hydrological, or epidemiological analyses.

Script flowchart for station registration data retrieval 

The four diagrams below provide a structured and progressive view of how the script works. The first diagram presents a macro-level overview of the four processing stages. The subsequent three diagrams detail the steps within each stage: data acquisition (orange and blue blocks), data post-processing and validation (green blocks), and the main pipeline execution with export (purple blocks). The same colour coding is used consistently across all diagrams to identify the category of each step.

Overview 

The overview diagram illustrates the four sequential stages of the pipeline at a high level, serving as a navigation reference for the detail diagrams that follow.

        flowchart TB
  A["Setup & Configuration"]:::configCode --> B["Schema & Registry Mapping"]:::configCode
  B --> C["Request & XML Parsing"]:::reqCode
  C --> D["Post-processing/Validation"]:::validacao
  D --> E["Outputs & Export"]:::exportacao

  classDef configCode fill:#fff3e0,stroke:#ef6c00,color:#000;
  classDef reqCode fill:#e3f2fd,stroke:#1565c0,color:#000;
  classDef validacao fill:#e8f5e9,stroke:#2e7d32,color:#000;
  classDef exportacao fill:#ede7f6,stroke:#5e35b1,color:#000;

Data Acquisition 

This diagram details the orange and blue stages: initial setup and configuration, registry field mapping, and the full HTTP request and XML parsing logic, including all error-handling branches.

        flowchart TB
  subgraph S["Setup & Configuration / Schema & Registry Mapping"]
    A1["Load packages + Define API constants"]:::configCode
    A2["Map registry fields + Define operation date columns"]:::configCode
    A1 --> A2
  end

  C0["Validate station type (1 or 2)"]:::reqCode
  C1["Build GET request (tpEst + nmEstado)"]:::reqCode
  C2{"HTTP 200?"}:::reqCode
  C2_no["Log HTTP error → return empty table"]:::reqCode
  C2_yes["Parse XML + drop namespaces"]:::reqCode
  C3{"XML error node?"}:::reqCode
  C3_yes["Log API error → return empty table"]:::reqCode
  C3_no["Locate Table nodes"]:::reqCode
  C4{"Stations found?"}:::reqCode
  C4_no["Warn: no data for state → return empty table"]:::reqCode
  C4_yes["Extract fields → build raw table"]:::reqCode

  A2 --> C0 --> C1 --> C2
  C2 -- "No"  --> C2_no
  C2 -- "Yes" --> C2_yes --> C3
  C3 -- "Yes" --> C3_yes
  C3 -- "No"  --> C3_no --> C4
  C4 -- "No"  --> C4_no
  C4 -- "Yes" --> C4_yes

  classDef configCode fill:#fff3e0,stroke:#ef6c00,color:#000;
  classDef reqCode fill:#e3f2fd,stroke:#1565c0,color:#000;
  classDef validacao fill:#e8f5e9,stroke:#2e7d32,color:#000;
  classDef exportacao fill:#ede7f6,stroke:#5e35b1,color:#000;

Post-processing and Validation 

This diagram details the green stage: all data cleaning, type casting, date computation, deduplication, and geographic validation steps applied to the raw table returned by the request function before any export or filtering occurs.

        flowchart TB
   D1["Fix coordinates + Convert numeric types"]:::validacao
   D2["Parse date columns: Periodo*; DataIns; DataAlt"]:::validacao
   D3["Standardize text fields + replace empty strings"]:::validacao
   D4["Compute: DataInicioOperacao; DataFimOperacao"]:::validacao
   D5["Deduplicate: keep most recent record per station"]:::validacao
   D6["Validate altitude + flag lat/lon outside Brazil"]:::validacao
   D1 --> D2 --> D3 --> D4 --> D5 --> D6

  classDef validacao fill:#e8f5e9,stroke:#2e7d32,color:#000;

Main Pipeline — Registration Data Acquisition 

This diagram shows the purple stage: the main pipeline orchestration, which loops over station types and states, calls the request and post-processing functions, applies the eligibility date filter, and optionally exports the results to Parquet and Excel files.

        flowchart TB
  H0["Select station types + target states"]:::configCode
  H1{"Query mode"}:::reqCode
  H1a["Iterate: type × state"]:::reqCode
  H1b["Single call per type (all states)"]:::reqCode
  H2["Run post-processing & validation"]:::validacao
  H3["Filter by DataInicioOperacao ≤ limite_inicio"]:::exportacao
  H4["Generate summaries by state"]:::exportacao
  H5{"Export results?"}:::exportacao
  X1["Write .parquet (gzip) + .xlsx by station type"]:::exportacao
  X2["Write summary by state (.xlsx)"]:::exportacao
  H6["Return structured results list"]:::exportacao

  H0 --> H1
  H1 -- "Iterate"      --> H1a --> H2
  H1 -- "Single call"  --> H1b --> H2
  H2 --> H3 --> H4 --> H5
  H5 -- "Yes" --> X1 --> X2 --> H6
  H5 -- "No"  --> H6

  classDef configCode fill:#fff3e0,stroke:#ef6c00,color:#000;
  classDef reqCode fill:#e3f2fd,stroke:#1565c0,color:#000;
  classDef validacao fill:#e8f5e9,stroke:#2e7d32,color:#000;
  classDef exportacao fill:#ede7f6,stroke:#5e35b1,color:#000;

Descriptive statistics of registered stations 

This section presents summary statistics for the two types of hydrometeorological stations registered in the ANA system: streamflow stations (Type 1) and rainfall stations (Type 2). For each type, statistics are presented by responsible agency and by eligibility for each climatological normal period. Stations are eligible for a given normal when their DataInicioOperacao falls on or before the cutoff date corresponding to that period (see Start and end dates of operation).

Streamflow gauges stations (Type 1)

The ANA system registers a total of 16,078 streamflow stations with a DataInicioOperacao on or before 01/01/2024. Table 1a presents the distribution of streamflow stations by responsible agency.

Table 1a — Streamflow stations (Type 1) by responsible agency
Ranking	Responsible agency	Frequency n (%)
1	ANA	3,914 (24.3%)
2	IGAM-MG	887 (5.5%)
3	CETESB-SP	609 (3.8%)
4	INEMA-BA	584 (3.6%)
5	SPÁGUAS-SP	493 (3.1%)
6	DNOS	395 (2.5%)
7	IAT-PR	354 (2.2%)
8	FURNAS	333 (2.1%)
9	FEPAM-RS	248 (1.5%)
10	INEA-RJ	237 (1.5%)
	(736 others)	8,024 (49.9%)

Source: Original research results.

Of the 16,078 streamflow stations registered with a start of operation on or before 01/01/2024, 24.3% are under the responsibility of ANA (National Water Agency), while 5.5% are managed by IGAM-MG (Minas Gerais State Water Management Institute). CETESB-SP (São Paulo State Environmental Agency) and INEMA-BA (Bahia State Institute for the Environment and Water Resources) account for 3.8% and 3.6% of stations registered in the ANA system, respectively. In addition to the 10 agencies with the highest representation (50.1%), a further 736 agencies (49.9%) also have stations registered in the system.

Table 1b presents the number of streamflow stations eligible for each climatological normal period.

Table 1b — Streamflow station eligibility by climatological normal
Normal	Period	Eligible stations	Eligibility (%)
1	1961–1990	6,534	40.6
2	1991–2020	13,249	82.4
3	2021–2024	16,078	100.0

Source: Original research results.

Of the 16,078 streamflow stations registered with a DataInicioOperacao on or before 01/01/2024, the proportion of stations eligible for each climatological normal decreases progressively as the period recedes further into the past: all 16,078 stations (100.0%) are eligible for the 2021–2024 normal, while 13,249 (82.4%) meet the eligibility criterion for the 1991–2020 normal, and 6,534 (40.6%) for the 1961–1990 normal (Table 1b). Eligibility is determined by a cutoff applied to DataInicioOperacao, which is discussed in detail in the sections Start and end dates of operation and Data Scope and Temporal Coverage.

Rainfall gauges stations (Type 2)

The ANA system registers a total of 19,985 rainfall stations with a DataInicioOperacao on or before 01/01/2024. Table 2a presents the distribution of rainfall stations by responsible agency.

Table 2a — Rainfall stations (Type 2) by responsible agency
Ranking	Responsible agency	Frequency n (%)
1	ANA	4,126 (20.6%)
2	CEMADEN	2,533 (12.7%)
3	SPÁGUAS-SP	1,878 (9.4%)
4	SUDENE	1,348 (6.7%)
5	IAT-PR	827 (4.1%)
6	INMET	827 (4.1%)
7	FUNCEME-CE	791 (4.0%)
8	DNOCS	637 (3.2%)
9	EMPARN-RN	398 (2.0%)
10	DNOS	295 (1.5%)
	(706 others)	6,325 (31.6%)

Source: Original research results.

Of the 19,985 rainfall stations registered with a start of operation on or before 01/01/2024 [1], 20.6% are under the responsibility of ANA, while 12.7% are managed by CEMADEN (National Centre for Monitoring and Early Warning of Natural Disasters). INMET (the National Institute of Meteorology) accounts for 4.1% of stations registered in the ANA system. In addition to the 10 agencies with the highest representation (68.4%), a further 706 agencies (31.6%) also have stations registered in the system.

Table 2b presents the number of rainfall stations eligible for each climatological normal period.

Table 2b — Rainfall station eligibility by climatological normal
Normal code	Period	Eligible stations	Eligibility (%)
1	1961–1990	11,627	58.2
2	1991–2020	19,582	98.0
3	2021–2024	19,985	100.0

Source: Original research results.

Of the 19,985 rainfall stations registered with a DataInicioOperacao on or before 01/01/2024, 19,985 (100.0%) are eligible for the 2021–2024 normal, 19,582 (98.0%) for the 1991–2020 normal, and 11,627 (58.2%) for the 1961–1990 normal (Table 2b). Eligibility is determined by a cutoff applied to DataInicioOperacao, which is discussed in detail in the sections Start and end dates of operation and Data Scope and Temporal Coverage.

Spatial distribution of ANA’s gauges stations 

This section presents interactive maps showing the spatial distribution of the stations registered in the ANA system for each station type. Maps are generated from the georeferenced inventory and rendered using the Folium library. Use the controls embedded in each map to zoom, pan, and inspect individual stations.

Streamflow gauges stations (Type 1)

The map below shows the spatial distribution of streamflow stations (Type 1) registered in the ANA system.

Although there are stations all over Brazil, their greatest concentration is on the east coast, where the largest cities are concentrated.

Rainfall gauges stations (Type 2)

The map below shows the spatial distribution of rainfall stations (Type 2) registered in the ANA system. The figure that follows the map illustrates the distribution by responsible agency.

Although there are stations all over Brazil, their greatest concentration is on the east coast, where the largest cities are concentrated.

Station Georeferencing: ANA–IBGE 2025 Spatial Integration 

Overview 

The station inventory provided by ANA includes geographic coordinates (latitude and longitude) and a field identifying the associated municipality (nmMunicipio). However, the municipality codes used in the ANA system are not compatible with the official IBGE coding scheme (CD_MUN), which prevents direct table joins between the two systems. Additionally, the nmMunicipio field presents data quality issues, including incomplete names and spelling inconsistencies, which further limits its use for administrative linkage.

To overcome these constraints, a coordinate-based spatial georeferencing pipeline was developed in Python. Each station is assigned a municipality — and consequently a standardised IBGE municipality code, name, and state identifier — based solely on its geographic coordinates and the official 2025 IBGE municipality boundary shapefile.

Note

The ANA’s HidroWeb download portal (https://www.snirh.gov.br/hidroweb/download) provides the Inventory file, which contains a database in .mdb format. In this database, the Municipality table establishes the correspondence between the municipality codes used by ANA and the respective official codes and names from IBGE (Brazilian Institute of Geography and Statistics). This table is used to resolve incompatibilities in station records that have a valid Municipality Code only in ANA’s systems. The IBGE municipal code obtained through this correspondence is preserved in the georeferenced dataset of the station registry through the cod_ibge_ana field, allowing traceability, referencing, and auditing.

Data Sources and Spatial Reference System 

The georeferencing pipeline uses two primary inputs:

ANA station inventories — 19,985 rainfall stations and 16,078 streamflow stations retrieved from the HidroInventário [1], each carrying Latitude and Longitude fields registered in SIRGAS 2000.
IBGE 2025 municipality shapefile — 5,573 municipal boundary polygons, also in SIRGAS 2000 geographic coordinates (EPSG:4674). Key fields used: CD_MUN (municipality code), NM_MUN (municipality name), SIGLA_UF (state abbreviation), NM_REGIAO (region name).

The full pipeline operates in EPSG:4674 (SIRGAS 2000 geographic). The only exception is the distance calculation step for unmatched stations, which temporarily reprojects the affected subset to EPSG:5880 (SIRGAS 2000 / Brazil Polyconic) to obtain distances in metres.

Georeferencing Methodology 

The pipeline follows a two-stage spatial association strategy implemented with GeoPandas.

Primary Association — Spatial Containment (`within`)

In the first stage, a spatial join with predicate='within' is performed between the station point geometries and the municipality polygon geometries. Each station is assigned to the municipality whose polygon contains its coordinates. This is the most accurate form of association, as it requires the station to be strictly inside the municipality boundary.

stations_gdf = gpd.sjoin(stations_gdf, mun_gdf, predicate='within', how='left')

Stations that are not contained within any polygon — typically due to coordinate imprecision, registration errors, or points falling exactly on a boundary — are left unmatched after this step and forwarded to the secondary association stage.

Secondary Association — Proximity (`nearest`)

In the second stage, unmatched stations are assigned to the geographically nearest municipality using gpd.sjoin_nearest(). Before this operation, both the unmatched stations and the municipality geometries are temporarily reprojected to EPSG:5880 to enable distance measurement in metres. The distance from each station to the nearest municipality boundary is recorded in the distancia_municipio_m column.

unmatched_proj = unmatched.to_crs(CRS_METRIC)
mun_proj       = mun_gdf.to_crs(CRS_METRIC)
result         = gpd.sjoin_nearest(unmatched_proj, mun_proj, distance_col='distancia_municipio_m')

Both geometries are then reprojected back to EPSG:4674 for consistency with the rest of the dataset.

Output Attributes 

Inventory Registration Fields

The ANA HidroInventário endpoint [1] returns, for each registered station, a fixed set of registration fields covering operational metadata. These fields describe the station’s location (state, municipality, river, and hydrographic basin), its responsible and operating agencies, the types of instruments installed (rain gauges, water level recorders, telemetric sensors, among others), the start and end dates of each measurement component, the current operational status, and a set of administrative control attributes related to record insertion and updating. The pipeline maintains a single, consolidated list of all field names, ensuring that every station — regardless of type or state — is stored using the same tabular structure, and that queries targeting different states or types return mutually compatible datasets.

Despite the relevance of this metadata for quality control, temporal filtering, and downstream analyses, ANA does not publish an official, comprehensive data dictionary for the fields returned by the HidroInventário endpoint [1]. At the time of this writing, no machine-readable or structured documentation covering the full set of inventory fields was identified in any publicly available ANA resource, including the HidroWeb portal, the ServiceANA documentation page, or the documentation of the new HidroWeb API currently under development. This absence of formal documentation introduces uncertainty in the interpretation of certain fields — particularly those related to instrument-specific operational flags, administrative sub-categories, and status codes whose values are not self-explanatory.

As a practical consequence, the description of a subset of fields was inferred from indirect sources: the behaviour of the returned values during processing, cross-referencing with the Estacao table available in the Inventário archive distributed via the HidroWeb download portal [2], and partial references found in third-party technical literature and official reports. Fields for which a reliable description could be established are documented accordingly. Fields for which no description could be confirmed from any institutional source remain flagged as undocumented pending formal clarification from ANA.

Note

A formal enquiry with ANA regarding the description of undocumented inventory fields is in progress. Once confirmed, the field-level documentation will be updated accordingly.

The complete list of registration fields will be available in this documentation as soon as it is obtained from official ANA sources, containing descriptions for the subset of fields that could be confirmed by institutional sources. Fields without confirmed descriptions are duly marked.

New variables

After both stages, each station record is enriched with the following columns:

Table 3 — Georeferencing output columns
Column	Type	Description
`uf_ana`	string	State abbreviation from ANA.
`cod_mun_ibge`	string	IBGE municipality code (`CD_MUN` from shapefile).
`nome_mun_ibge`	string	Official municipality name (`NM_MUN` from shapefile).
`uf_ibge`	string	State abbreviation from shapefile (`SIGLA_UF`).
`metodo_associacao`	string	`"within"` for containment; `"nearest"` for proximity.
`distancia_municipio_m`	float	Distance in metres to nearest boundary (`nearest` only; `NaN` for `within`).
`uf_divergente`	boolean	Value is `True` if the state inferred spatially (`SIGLA_UF`) differs from the state recorded in `nmEstado` (ANA).
`cod_ibge_ana`	string	IBGE municipality code as provided in the ANA `Municipio` table (see note below); retained for reference and audit purposes only.

Note

However, the official identification of the municipality adopted in all subsequent analyses is that obtained through spatial merging based on the station’s coordinates (fields cod_mun_ibge and nome_mun_ibge). This approach was adopted because it more accurately represents the actual location of the station in relation to the most recent municipal boundaries of the IBGE (2025).

Georeferencing Results 

The table below will present a summary of the georeferencing results for each station type, once the full pipeline has been executed and reviewed.

Table 4 — Georeferencing result summary by station type
Station type	Total stations	`within` (%)	`nearest` (%)
Streamflow station (Type 1)	16,078	16,037 (99.7%)	41 (0.3%)
Rain gauge station (Type 2)	19,985	19,904 (99.6%)	81 (0.4%)

The georeferencing pipeline successfully assigned all 36,063 stations to a Brazilian municipality. The within predicate resolved the vast majority of associations: 99.7% of streamflow stations (16,037 of 16,078) and 99.6% of rain gauge stations (19,904 of 19,985) were spatially contained within a single municipality polygon. The remaining stations — 41 streamflow (0.3%) and 81 rain gauge (0.4%) — were assigned via the nearest predicate, indicating that their registered coordinates fall outside all municipal boundaries, most likely due to positional imprecision in the ANA inventory.

Georeferencing Pipeline Diagram 

The diagram below summarises the full georeferencing workflow.

        flowchart TB
  A["Load ANA station inventories (19,985 rainfall + 16,078 streamflow)"]:::configCode
  B["Load IBGE 2025 municipality shapefile (5,573 features, EPSG:4674)"]:::configCode
  D["Build station GeoDataFrame (Latitude / Longitude → geometry, EPSG:4674)"]:::validacao

  E["Stage 1: spatial join within sjoin(predicate='within')"]:::reqCode
  F{"All stations matched?"}:::reqCode
  G["Stage 2: proximity join sjoin_nearest() in EPSG:5880"]:::reqCode

  H["Merge within + nearest results"]:::validacao
  I["Add metodo_associacao / distancia_municipio_m / uf_divergente"]:::validacao

  K["Export georeferenced inventory (.parquet + .xlsx)"]:::exportacao

  A --> D
  B --> E
  D --> E
  E --> F
  F -- "Yes" --> H
  F -- "No (unmatched)" --> G --> H
  H --> I --> K

  classDef configCode fill:#fff3e0,stroke:#ef6c00,color:#000;
  classDef reqCode fill:#e3f2fd,stroke:#1565c0,color:#000;
  classDef validacao fill:#e8f5e9,stroke:#2e7d32,color:#000;
  classDef exportacao fill:#ede7f6,stroke:#5e35b1,color:#000;

Nearest-matched Stations Map 

Stations that could not be assigned to a municipality via spatial containment (within) were matched by proximity to the nearest municipality boundary (nearest). These stations typically lie outside all municipality polygons due to coordinate imprecision or positions near state boundaries. The map below shows their spatial distribution.

References

Contributors

Contributor	Institution	Location
Marcos Eustorgio Filho	Center for Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz	Salvador, Brazil
Danielson Neves	Center for Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz	Salvador, Brazil

ANA datasets

ANA gauges stations 

Introduction 

List of stations registered by the ANA 

Syntax and requests 

Configuration and auxiliary functions 

Candidate start and end dates of operation 

Inventory request and parsing 

Data cleaning and type casting 

Start and end dates of operation 

Station-level deduplication 

Geographic validation 

Altitude validation 

Latitude and longitude validation 

Main processing pipeline 

Script flowchart for station registration data retrieval 

Overview 

Data Acquisition 

Post-processing and Validation 

Main Pipeline — Registration Data Acquisition 

Descriptive statistics of registered stations 

Streamflow gauges stations (Type 1)

Rainfall gauges stations (Type 2)

Spatial distribution of ANA’s gauges stations 

Streamflow gauges stations (Type 1)

Rainfall gauges stations (Type 2)

Station Georeferencing: ANA–IBGE 2025 Spatial Integration 

Overview 

Data Sources and Spatial Reference System 

Georeferencing Methodology 

Primary Association — Spatial Containment (`within`)

Secondary Association — Proximity (`nearest`)

Output Attributes 

Inventory Registration Fields

New variables

Georeferencing Results 

Georeferencing Pipeline Diagram 

Nearest-matched Stations Map 