Skip to content

Commit 63cd9ab

Browse files
added details for updating tableau
1 parent 491e952 commit 63cd9ab

1 file changed

Lines changed: 309 additions & 0 deletions

File tree

scripts/tableau/readme.md

Lines changed: 309 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,309 @@
1+
# Tableau DQ Metrics
2+
3+
This readme describes the process
4+
for publishing the DQ metrics data from AWS S3 to the Tableau Server.
5+
6+
There are two methods for publishing the data to Tableau Server:
7+
1. [Running the Python scripts locally](#python-scripts)
8+
2. Using the [GitHub Actions workflow](#github-actions-workflow) to run the scripts on a schedule
9+
10+
11+
# Overview
12+
13+
This section provides an overview of the process for publishing DQ metrics data to Tableau Server.
14+
15+
The python scripts are used to generate a Tableau `.hyper` datasource file from the DQ metric
16+
JSON files stored in S3.
17+
This `.hyper` file is then published to Tableau Server by overwriting an existing datasource.
18+
Once this datasource is updated, a ping is sent to a Tableau view to trigger a cache refresh,
19+
ensuring that the latest data is updated and available for visualisation.
20+
21+
The following sections provide more details on the Python scripts and the GitHub Actions workflow that automates this process.
22+
23+
# Python Scripts
24+
25+
There are two Python scripts involved:
26+
27+
- `generate_tableau_data.py`
28+
Reads DQ metric JSON files from the S3 bucket `eligibility-signposting-api-dev-dq-metrics`, filters to approximately the last 3 months, and writes a Tableau Hyper extract called `converted.hyper`.
29+
30+
- `tableau_refresh.py`
31+
Publishes `./converted.hyper` to Tableau Server by overwriting an existing datasource, then pings a Tableau view to trigger a cache refresh.
32+
33+
This work supports the EliD DQ metrics Tableau MVP, where Tableau is being used to visualise DQ metrics for monitoring and comparison against expected thresholds.
34+
35+
---
36+
37+
## What the scripts do
38+
39+
### 1. Generate Hyper extract
40+
41+
`generate_tableau_data.py`:
42+
43+
- connects to S3
44+
- scans daily `processing_date=YYYYMMDD/` prefixes for the last 90 days
45+
- reads `.json` files from the bucket
46+
- parses JSON or JSONL content into a pandas DataFrame
47+
- creates a Tableau Hyper file named `converted.hyper`
48+
49+
The S3 source bucket is currently hard coded as:
50+
51+
```python
52+
S3_BUCKET = "eligibility-signposting-api-dev-dq-metrics"
53+
```
54+
55+
and the output file is:
56+
57+
```python
58+
LOCAL_HYPER_PATH = "converted.hyper"
59+
```
60+
61+
### 2. Publish to Tableau
62+
63+
`tableau_refresh.py`:
64+
65+
- checks that `./converted.hyper` exists
66+
- validates the file type
67+
- reads Tableau credentials and settings from environment variables
68+
- signs in using a Tableau Personal Access Token (PAT)
69+
- overwrites the configured datasource
70+
- pings the Tableau view `EligibilityData-DQMetrics/DataQualityMetrics?:refresh=y` to trigger refresh
71+
72+
NOTE: PAT credentials must be set as GitHub secrets for the workflow, and as environment variables for local testing.
73+
74+
75+
---
76+
77+
## Repository structure
78+
79+
The GitHub Actions workflow expects the scripts to exist at:
80+
81+
```text
82+
scripts/tableau/generate_tableau_data.py
83+
scripts/tableau/tableau_refresh.py
84+
```
85+
86+
because it runs:
87+
88+
```yaml
89+
python scripts/tableau/generate_tableau_data.py
90+
python scripts/tableau/tableau_refresh.py
91+
```
92+
93+
---
94+
95+
## Running locally
96+
97+
## Prerequisites
98+
99+
You will need:
100+
101+
- Python 3.13 recommended, to match the workflow setup.
102+
- Access to AWS with permission to read from the S3 bucket `eligibility-signposting-api-dev-dq-metrics`.
103+
- Tableau Personal Access Token credentials
104+
- The required Python packages installed:
105+
- `boto3`
106+
- `pandas`
107+
- `tableauserverclient`
108+
- `tableauhyperapi`
109+
- `requests`
110+
111+
### Install dependencies
112+
113+
If installing dependencies locally
114+
115+
```bash
116+
pip install boto3 pandas tableauserverclient tableauhyperapi requests
117+
```
118+
119+
### Required environment variables
120+
121+
Before publishing to Tableau, set the following environment variables:
122+
123+
```bash
124+
export TABLEAU_TOKEN_NAME="your-token-name"
125+
export TABLEAU_TOKEN_VALUE="your-token-value"
126+
export TABLEAU_SERVER_URL="https://your-tableau-server"
127+
export TABLEAU_DATASOURCE_ID="your-datasource-id"
128+
export TABLEAU_SITE_ID="NHSD_DEV"
129+
```
130+
131+
`TABLEAU_SERVER_URL` is the base URL of the Tableau Server instance, for example `https://tableau.nhsd.com`.
132+
133+
`TABLEAU_DATASOURCE_ID` is the ID of the datasource to overwrite,
134+
which can be found in the Tableau Server URL when viewing the datasource (LUID).
135+
136+
`TABLEAU_SITE_ID` defaults to `NHSD_DEV` if not set.
137+
138+
### AWS credentials
139+
140+
You also need AWS credentials available locally so `boto3` can read from S3.
141+
Also, may need to set the AWS region if not configured globally:
142+
143+
```bash
144+
export AWS_REGION=eu-west-2
145+
```
146+
147+
The workflow uses `eu-west-2`.
148+
149+
### Run the scripts
150+
151+
Step 1: Generate the Hyper file
152+
153+
```bash
154+
python scripts/tableau/generate_tableau_data.py
155+
```
156+
157+
This should create:
158+
159+
```text
160+
converted.hyper
161+
```
162+
163+
Step 2: Publish to Tableau
164+
165+
```bash
166+
python scripts/tableau/tableau_refresh.py
167+
```
168+
169+
If you want the publish step to continue even when the cache refresh ping fails:
170+
171+
```bash
172+
python scripts/tableau/tableau_refresh.py --ignore-refresh-failure
173+
```
174+
175+
The optional `--ignore-refresh-failure` flag prevents the script from exiting with an error
176+
if the Tableau refresh ping fails.
177+
178+
---
179+
180+
## Expected local flow
181+
182+
1. Read recent DQ metric JSON data from S3
183+
2. Build `converted.hyper`
184+
3. Sign in to Tableau using PAT credentials
185+
4. Overwrite the target datasource
186+
5. Trigger cache refresh for the relevant workbook view
187+
188+
---
189+
190+
191+
192+
# GitHub Actions workflow
193+
194+
The GitHub Actions workflow is named:
195+
196+
```yaml
197+
Daily Tableau Data Update
198+
```
199+
200+
It supports:
201+
202+
- scheduled execution every day at `10:00 AM UTC`
203+
- manual triggering using `workflow_dispatch` for testing
204+
205+
### Workflow triggers
206+
207+
```yaml
208+
on:
209+
schedule:
210+
- cron: '0 10 * * *'
211+
workflow_dispatch:
212+
```
213+
214+
### Workflow jobs
215+
216+
The workflow has two jobs:
217+
218+
### 1. `metadata`
219+
220+
This job:
221+
222+
- checks out the repo
223+
- reads versions from `.tool-versions`
224+
- sets CI metadata such as build timestamp and version string
225+
226+
### 2. `publish`
227+
228+
This job:
229+
230+
- sets up Python 3.13
231+
- checks out the repository
232+
- installs the required Python packages
233+
- assumes the AWS deployment role using GitHub OIDC
234+
- runs the S3 to Hyper script
235+
- publishes the datasource to Tableau
236+
237+
---
238+
239+
## GitHub Actions secrets and variables
240+
241+
The workflow requires the following GitHub environment configuration.
242+
243+
### Secrets
244+
245+
- `AWS_ACCOUNT_ID`
246+
- `TABLEAU_TOKEN_NAME`
247+
- `TABLEAU_TOKEN_VALUE`
248+
- `TABLEAU_DATASOURCE_ID`
249+
250+
### Variables
251+
252+
- `TABLEAU_SITE_ID`
253+
- `TABLEAU_SERVER_URL`
254+
255+
### GitHub environment
256+
257+
The workflow runs under the `dev` environment.
258+
259+
---
260+
261+
## Example GitHub Actions execution flow
262+
263+
```text
264+
Schedule or manual trigger
265+
-> metadata job
266+
-> publish job
267+
-> setup Python
268+
-> install dependencies
269+
-> assume AWS role
270+
-> generate converted.hyper from S3 JSON files
271+
-> publish converted.hyper to Tableau datasource
272+
-> trigger Tableau cache refresh
273+
```
274+
275+
---
276+
277+
## Troubleshooting
278+
279+
### `Datasource file not found: ./converted.hyper`
280+
281+
The publish script expects `converted.hyper` to exist in the current working directory. Run the data generation script first.
282+
283+
### `Missing required environment variables`
284+
285+
Ensure the required Tableau environment variables are set before running `tableau_refresh.py`.
286+
287+
### No data found
288+
289+
If no JSON files are found for the date range, the generation script will print:
290+
291+
```text
292+
No data found for the selected date range.
293+
```
294+
295+
and no Hyper file will be created.
296+
297+
### Cache refresh fails
298+
299+
By default, a Tableau cache refresh failure causes the script to exit with a non zero status. Use:
300+
301+
```bash
302+
python scripts/tableau/tableau_refresh.py --ignore-refresh-failure
303+
```
304+
305+
if you want to allow publish success even when the refresh ping fails.
306+
307+
308+
309+

0 commit comments

Comments
 (0)