Skip to content

feat(data): centralize sample-dataset metadata in a manifest (#774)#886

Merged
henrydingliu merged 3 commits into
casact:mainfrom
SaguaroDev:774-centralize-sample-metadata
May 31, 2026
Merged

feat(data): centralize sample-dataset metadata in a manifest (#774)#886
henrydingliu merged 3 commits into
casact:mainfrom
SaguaroDev:774-centralize-sample-metadata

Conversation

@SaguaroDev
Copy link
Copy Markdown
Contributor

@SaguaroDev SaguaroDev commented May 31, 2026

Closes #774.

Sample-dataset metadata was duplicated across four places: the load_sample if/elif chain, the tests, docs/library/sample_data.md, and the per-file include list in MANIFEST.in. They had already drifted: sample_data.md listed 23 of 46 datasets (with names like frieldand_uspp_auto_steady_state and friedland_us_industy_auto_case), and MANIFEST.in listed 22, so the sdist shipped incomplete.

This adds chainladder/utils/data/_manifest.py as the single source of truth (a plain Python dict, no new dependency per the issue discussion) and keys everything off it.

load_sample now looks up its Triangle config from the manifest instead of the if/elif chain. This is behavior-preserving: I captured the resolved origin/development/index/columns/cumulative for all 46 bundled samples before and after the refactor and they are byte-identical.

cl.list_samples() is the new utility @henrydingliu asked for. Per the design confirmed on the issue, it returns a DataFrame of name, index, columns, cumulative, and (when include_grain=True, the default) origin/development grain plus period counts. include_grain=False skips loading the data for a fast metadata-only listing. It doubles as the source for the docs table.

MANIFEST.in collapses the 22 hand-listed include lines to one recursive-include chainladder/utils/data *.csv. I built the sdist both ways and confirmed it now contains all 46 CSVs (was 22). Wheels were already complete via pyproject.toml's package-data wildcard; this fixes the sdist.

Tests test_load_sample now iterates the manifest rather than globbing the data directory, so adding a sample is a one-entry change and stray non-CSV files (__init__.py, _manifest.py) can't be mistaken for datasets. Added a both-ways sync assertion (manifest set == CSVs on disk) and a test_list_samples test.

docs/library/sample_data.md is regenerated from cl.list_samples() via scripts/regen_sample_data_docs.py. The table is now complete and accurate; rerun the script after adding a dataset.

Full test suite: 717 passed, 12 xfailed (pre-existing), 0 failures.


Note

Low Risk
Refactor of sample-loading and packaging metadata with tests guarding manifest/CSV parity and sdist contents; intended behavior parity for load_sample with no auth or production runtime path changes.

Overview
Introduces chainladder/utils/data/_manifest.py as the single registry for all bundled sample CSV metadata (origin, development, index, columns, cumulative). load_sample now validates keys and builds Triangle instances from that dict instead of a long per-dataset if/elif chain (intended to be behavior-preserving).

Adds cl.list_samples() (exported from chainladder.utils) to return a manifest-driven catalog, optionally loading each sample for grain/period counts when include_grain=True.

MANIFEST.in switches from ~22 explicit CSV includes to recursive-include chainladder/utils/data *.csv, fixing incomplete sdists; tests assert manifest ↔ on-disk CSV sync, cover list_samples, and optionally build an sdist to verify all CSVs ship.

Docs replace the static sample_data.md table with sample_data.ipynb (live cl.list_samples()) and document list_samples in the API reference.

Reviewed by Cursor Bugbot for commit 724b455. Bugbot is set up for automated code reviews on this repo. Configure here.

)

Sample-dataset metadata was duplicated across four places: the long
if/elif chain in load_sample, the tests, docs/library/sample_data.md,
and the per-file include list in MANIFEST.in. Adding a sample meant
editing all four by hand, and they had already drifted (sample_data.md
listed 23 of 46 datasets with several typo-d names; MANIFEST.in listed
22, so the sdist shipped incomplete).

Introduce chainladder/utils/data/_manifest.py as the single source of
truth (a plain Python dict, no new dependency), and key everything off
it:

- load_sample now looks up its Triangle config from the manifest instead
  of the if/elif chain. Verified behavior-preserving: the resolved
  origin/development/index/columns/cumulative for all 46 bundled samples
  is byte-identical before and after.
- New public cl.list_samples() returns a DataFrame of name, index,
  columns, cumulative, and (optionally) grain + period counts. Doubles
  as the source for the docs table.
- test_load_sample iterates the manifest rather than globbing the data
  directory, so adding a sample is a one-entry change and stray non-CSV
  files cannot be mistaken for datasets. Added a both-ways sync assertion
  (manifest == CSVs on disk) and a test_list_samples test.
- MANIFEST.in collapses the 22 hand-listed includes to one
  recursive-include chainladder/utils/data *.csv. Verified the built
  sdist now contains all 46 CSVs (was 22).
- docs/library/sample_data.md is regenerated from cl.list_samples() via
  scripts/regen_sample_data_docs.py; the table is now complete and
  accurate.

Closes casact#774.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.90%. Comparing base (449b5c1) to head (724b455).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #886      +/-   ##
==========================================
- Coverage   87.04%   86.90%   -0.15%     
==========================================
  Files          86       87       +1     
  Lines        4986     4932      -54     
  Branches      646      624      -22     
==========================================
- Hits         4340     4286      -54     
  Misses        456      456              
  Partials      190      190              
Flag Coverage Δ
unittests 86.90% <100.00%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@henrydingliu
Copy link
Copy Markdown
Collaborator

@SaguaroDev thanks for all this! will the new utility function list_sample get added to the api reference section of the doc automatically? or do we need to manually add it?

…asact#774)

Addresses review feedback on casact#886 from @henrydingliu:

- Add list_samples to docs/library/api.md autosummary and a matching
  generated stub so the new utility appears in the API reference. The
  reference is hand-maintained, not auto-discovered, so this is a manual
  add. Also dropped a pre-existing duplicate load_sample entry in the
  same autosummary block.
- Add test_sdist_ships_all_samples: builds a source distribution and
  asserts every sample CSV is present, guarding against MANIFEST.in
  drifting out of sync again. Self-skips when the build package or a
  source checkout is unavailable, so it stays out of the fast suite as
  Henry suggested.
@SaguaroDev
Copy link
Copy Markdown
Contributor Author

It's manual, not automatic. The API reference (docs/library/api.md) hand-lists each function in an autosummary block, so I added list_samples there plus a matching generated stub, and dropped a duplicate load_sample entry that was already in that block.

On the sdist: added test_sdist_ships_all_samples, which builds a source distribution and asserts every sample CSV is present. It self-skips when the build package or a source checkout isn't available, so it proves completeness in CI without weighing down the fast suite.

Pushed 577fc84.

@@ -0,0 +1,6 @@
chainladder.list\_samples
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've been deleting these stubs in #879. can you confirm if creating list_samples_rst is still necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped it. The committed stub is redundant anyway since autosummary regenerates it at build, and with #879 removing these on experimental there's no reason to add a new one here. The list_samples entry stays in the api.md autosummary block, which is what actually wires it into the API reference.

Comment thread docs/library/sample_data.md Outdated
@@ -4,54 +4,55 @@ Below is the list of all datasets that come included with the `chainladder` pack

You can load any dataset with `cl.load_sample(...)` such as `cl.load_sample("abc")`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, now that we have list_samples, would it be cleaner to just change this markdown into a notebook? i.e. just call list_samples in the notebook

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 724b455. sample_data.md is now sample_data.ipynb with a single cl.list_samples() cell, so the table renders live from the manifest and can't drift. I kept the # Sample Dataset H1 so the existing sample_data.html#sample-dataset anchor (linked from the tutorials) still resolves, updated the _toc.yml entry, and removed scripts/regen_sample_data_docs.py since there's nothing static left to regenerate.

…ct#774)

Per Henry's review on casact#886: replace the static sample_data.md table with
a notebook (sample_data.ipynb) that renders the table live via
cl.list_samples(), so the docs can never drift from the manifest. This
also retires the now-pointless scripts/regen_sample_data_docs.py and the
committed list_samples.rst autosummary stub (autosummary regenerates it
at build, and the experimental branch is removing these stubs anyway;
the api.md autosummary entry for list_samples stays).

The notebook keeps the '# Sample Dataset' H1 so the existing
sample_data.html#sample-dataset anchor used by the tutorials still
resolves. _toc.yml, the manifest module docstring, and the
utility_functions comment are updated to point at the notebook.
henrydingliu pushed a commit to henrydingliu/chainladder-python that referenced this pull request May 31, 2026
…asact#774)

Addresses review feedback on casact#886 from @henrydingliu:

- Add list_samples to docs/library/api.md autosummary and a matching
  generated stub so the new utility appears in the API reference. The
  reference is hand-maintained, not auto-discovered, so this is a manual
  add. Also dropped a pre-existing duplicate load_sample entry in the
  same autosummary block.
- Add test_sdist_ships_all_samples: builds a source distribution and
  asserts every sample CSV is present, guarding against MANIFEST.in
  drifting out of sync again. Self-skips when the build package or a
  source checkout is unavailable, so it stays out of the fast suite as
  Henry suggested.
henrydingliu pushed a commit to henrydingliu/chainladder-python that referenced this pull request May 31, 2026
…ct#774)

Per Henry's review on casact#886: replace the static sample_data.md table with
a notebook (sample_data.ipynb) that renders the table live via
cl.list_samples(), so the docs can never drift from the manifest. This
also retires the now-pointless scripts/regen_sample_data_docs.py and the
committed list_samples.rst autosummary stub (autosummary regenerates it
at build, and the experimental branch is removing these stubs anyway;
the api.md autosummary entry for list_samples stays).

The notebook keeps the '# Sample Dataset' H1 so the existing
sample_data.html#sample-dataset anchor used by the tutorials still
resolves. _toc.yml, the manifest module docstring, and the
utility_functions comment are updated to point at the notebook.
Copy link
Copy Markdown
Collaborator

@henrydingliu henrydingliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed sample_data.ipynb on fork to validate build. confirmed that warning from list_sample is pre-existing.

@henrydingliu henrydingliu merged commit 8a9a46e into casact:main May 31, 2026
10 checks passed
henrydingliu pushed a commit that referenced this pull request May 31, 2026
…774)

Addresses review feedback on #886 from @henrydingliu:

- Add list_samples to docs/library/api.md autosummary and a matching
  generated stub so the new utility appears in the API reference. The
  reference is hand-maintained, not auto-discovered, so this is a manual
  add. Also dropped a pre-existing duplicate load_sample entry in the
  same autosummary block.
- Add test_sdist_ships_all_samples: builds a source distribution and
  asserts every sample CSV is present, guarding against MANIFEST.in
  drifting out of sync again. Self-skips when the build package or a
  source checkout is unavailable, so it stays out of the fast suite as
  Henry suggested.
@SaguaroDev SaguaroDev deleted the 774-centralize-sample-metadata branch May 31, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Centralize sample-dataset metadata (load_sample / tests / docs / MANIFEST.in)

2 participants