feat(data): centralize sample-dataset metadata in a manifest (#774) by SaguaroDev · Pull Request #886 · casact/chainladder-python

SaguaroDev · 2026-05-31T00:03:50Z

Closes #774.

Sample-dataset metadata was duplicated across four places: the load_sample if/elif chain, the tests, docs/library/sample_data.md, and the per-file include list in MANIFEST.in. They had already drifted: sample_data.md listed 23 of 46 datasets (with names like frieldand_uspp_auto_steady_state and friedland_us_industy_auto_case), and MANIFEST.in listed 22, so the sdist shipped incomplete.

This adds chainladder/utils/data/_manifest.py as the single source of truth (a plain Python dict, no new dependency per the issue discussion) and keys everything off it.

load_sample now looks up its Triangle config from the manifest instead of the if/elif chain. This is behavior-preserving: I captured the resolved origin/development/index/columns/cumulative for all 46 bundled samples before and after the refactor and they are byte-identical.

cl.list_samples() is the new utility @henrydingliu asked for. Per the design confirmed on the issue, it returns a DataFrame of name, index, columns, cumulative, and (when include_grain=True, the default) origin/development grain plus period counts. include_grain=False skips loading the data for a fast metadata-only listing. It doubles as the source for the docs table.

MANIFEST.in collapses the 22 hand-listed include lines to one recursive-include chainladder/utils/data *.csv. I built the sdist both ways and confirmed it now contains all 46 CSVs (was 22). Wheels were already complete via pyproject.toml's package-data wildcard; this fixes the sdist.

Tests test_load_sample now iterates the manifest rather than globbing the data directory, so adding a sample is a one-entry change and stray non-CSV files (__init__.py, _manifest.py) can't be mistaken for datasets. Added a both-ways sync assertion (manifest set == CSVs on disk) and a test_list_samples test.

docs/library/sample_data.md is regenerated from cl.list_samples() via scripts/regen_sample_data_docs.py. The table is now complete and accurate; rerun the script after adding a dataset.

Full test suite: 717 passed, 12 xfailed (pre-existing), 0 failures.

Note

Low Risk
Refactor of sample-loading and packaging metadata with tests guarding manifest/CSV parity and sdist contents; intended behavior parity for load_sample with no auth or production runtime path changes.

Overview
Introduces chainladder/utils/data/_manifest.py as the single registry for all bundled sample CSV metadata (origin, development, index, columns, cumulative). load_sample now validates keys and builds Triangle instances from that dict instead of a long per-dataset if/elif chain (intended to be behavior-preserving).

Adds cl.list_samples() (exported from chainladder.utils) to return a manifest-driven catalog, optionally loading each sample for grain/period counts when include_grain=True.

MANIFEST.in switches from ~22 explicit CSV includes to recursive-include chainladder/utils/data *.csv, fixing incomplete sdists; tests assert manifest ↔ on-disk CSV sync, cover list_samples, and optionally build an sdist to verify all CSVs ship.

Docs replace the static sample_data.md table with sample_data.ipynb (live cl.list_samples()) and document list_samples in the API reference.

^{Reviewed by Cursor Bugbot for commit 724b455. Bugbot is set up for automated code reviews on this repo. Configure here.}

) Sample-dataset metadata was duplicated across four places: the long if/elif chain in load_sample, the tests, docs/library/sample_data.md, and the per-file include list in MANIFEST.in. Adding a sample meant editing all four by hand, and they had already drifted (sample_data.md listed 23 of 46 datasets with several typo-d names; MANIFEST.in listed 22, so the sdist shipped incomplete). Introduce chainladder/utils/data/_manifest.py as the single source of truth (a plain Python dict, no new dependency), and key everything off it: - load_sample now looks up its Triangle config from the manifest instead of the if/elif chain. Verified behavior-preserving: the resolved origin/development/index/columns/cumulative for all 46 bundled samples is byte-identical before and after. - New public cl.list_samples() returns a DataFrame of name, index, columns, cumulative, and (optionally) grain + period counts. Doubles as the source for the docs table. - test_load_sample iterates the manifest rather than globbing the data directory, so adding a sample is a one-entry change and stray non-CSV files cannot be mistaken for datasets. Added a both-ways sync assertion (manifest == CSVs on disk) and a test_list_samples test. - MANIFEST.in collapses the 22 hand-listed includes to one recursive-include chainladder/utils/data *.csv. Verified the built sdist now contains all 46 CSVs (was 22). - docs/library/sample_data.md is regenerated from cl.list_samples() via scripts/regen_sample_data_docs.py; the table is now complete and accurate. Closes casact#774.

codecov · 2026-05-31T00:08:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.90%. Comparing base (449b5c1) to head (724b455).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #886      +/-   ##
==========================================
- Coverage   87.04%   86.90%   -0.15%     
==========================================
  Files          86       87       +1     
  Lines        4986     4932      -54     
  Branches      646      624      -22     
==========================================
- Hits         4340     4286      -54     
  Misses        456      456              
  Partials      190      190

Flag	Coverage Δ
unittests	`86.90% <100.00%> (-0.15%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

henrydingliu · 2026-05-31T00:26:26Z

@SaguaroDev thanks for all this! will the new utility function list_sample get added to the api reference section of the doc automatically? or do we need to manually add it?

@henrydingliu

…asact#774) Addresses review feedback on casact#886 from @henrydingliu: - Add list_samples to docs/library/api.md autosummary and a matching generated stub so the new utility appears in the API reference. The reference is hand-maintained, not auto-discovered, so this is a manual add. Also dropped a pre-existing duplicate load_sample entry in the same autosummary block. - Add test_sdist_ships_all_samples: builds a source distribution and asserts every sample CSV is present, guarding against MANIFEST.in drifting out of sync again. Self-skips when the build package or a source checkout is unavailable, so it stays out of the fast suite as Henry suggested.

SaguaroDev · 2026-05-31T03:01:00Z

It's manual, not automatic. The API reference (docs/library/api.md) hand-lists each function in an autosummary block, so I added list_samples there plus a matching generated stub, and dropped a duplicate load_sample entry that was already in that block.

On the sdist: added test_sdist_ships_all_samples, which builds a source distribution and asserts every sample CSV is present. It self-skips when the build package or a source checkout isn't available, so it proves completeness in CI without weighing down the fast suite.

Pushed 577fc84.

henrydingliu · 2026-05-31T04:22:47Z

@@ -0,0 +1,6 @@
+chainladder.list\_samples


we've been deleting these stubs in #879. can you confirm if creating list_samples_rst is still necessary?

Dropped it. The committed stub is redundant anyway since autosummary regenerates it at build, and with #879 removing these on experimental there's no reason to add a new one here. The list_samples entry stays in the api.md autosummary block, which is what actually wires it into the API reference.

henrydingliu · 2026-05-31T04:32:30Z

@@ -4,54 +4,55 @@ Below is the list of all datasets that come included with the `chainladder` pack

 You can load any dataset with `cl.load_sample(...)` such as `cl.load_sample("abc")`.


actually, now that we have list_samples, would it be cleaner to just change this markdown into a notebook? i.e. just call list_samples in the notebook

Done in 724b455. sample_data.md is now sample_data.ipynb with a single cl.list_samples() cell, so the table renders live from the manifest and can't drift. I kept the # Sample Dataset H1 so the existing sample_data.html#sample-dataset anchor (linked from the tutorials) still resolves, updated the _toc.yml entry, and removed scripts/regen_sample_data_docs.py since there's nothing static left to regenerate.

…ct#774) Per Henry's review on casact#886: replace the static sample_data.md table with a notebook (sample_data.ipynb) that renders the table live via cl.list_samples(), so the docs can never drift from the manifest. This also retires the now-pointless scripts/regen_sample_data_docs.py and the committed list_samples.rst autosummary stub (autosummary regenerates it at build, and the experimental branch is removing these stubs anyway; the api.md autosummary entry for list_samples stays). The notebook keeps the '# Sample Dataset' H1 so the existing sample_data.html#sample-dataset anchor used by the tutorials still resolves. _toc.yml, the manifest module docstring, and the utility_functions comment are updated to point at the notebook.

@henrydingliu

…asact#774) Addresses review feedback on casact#886 from @henrydingliu: - Add list_samples to docs/library/api.md autosummary and a matching generated stub so the new utility appears in the API reference. The reference is hand-maintained, not auto-discovered, so this is a manual add. Also dropped a pre-existing duplicate load_sample entry in the same autosummary block. - Add test_sdist_ships_all_samples: builds a source distribution and asserts every sample CSV is present, guarding against MANIFEST.in drifting out of sync again. Self-skips when the build package or a source checkout is unavailable, so it stays out of the fast suite as Henry suggested.

…ct#774) Per Henry's review on casact#886: replace the static sample_data.md table with a notebook (sample_data.ipynb) that renders the table live via cl.list_samples(), so the docs can never drift from the manifest. This also retires the now-pointless scripts/regen_sample_data_docs.py and the committed list_samples.rst autosummary stub (autosummary regenerates it at build, and the experimental branch is removing these stubs anyway; the api.md autosummary entry for list_samples stays). The notebook keeps the '# Sample Dataset' H1 so the existing sample_data.html#sample-dataset anchor used by the tutorials still resolves. _toc.yml, the manifest module docstring, and the utility_functions comment are updated to point at the notebook.

henrydingliu

reviewed sample_data.ipynb on fork to validate build. confirmed that warning from list_sample is pre-existing.

@henrydingliu

…774) Addresses review feedback on #886 from @henrydingliu: - Add list_samples to docs/library/api.md autosummary and a matching generated stub so the new utility appears in the API reference. The reference is hand-maintained, not auto-discovered, so this is a manual add. Also dropped a pre-existing duplicate load_sample entry in the same autosummary block. - Add test_sdist_ships_all_samples: builds a source distribution and asserts every sample CSV is present, guarding against MANIFEST.in drifting out of sync again. Self-skips when the build package or a source checkout is unavailable, so it stays out of the fast suite as Henry suggested.

SaguaroDev requested review from genedan, henrydingliu, jbogaardt and kennethshsu as code owners May 31, 2026 00:03

henrydingliu reviewed May 31, 2026

View reviewed changes

henrydingliu approved these changes May 31, 2026

View reviewed changes

henrydingliu merged commit 8a9a46e into casact:main May 31, 2026
10 checks passed

henrydingliu mentioned this pull request May 31, 2026

#869 friedland auto bi insurer premium #870

Merged

1 task

SaguaroDev deleted the 774-centralize-sample-metadata branch May 31, 2026 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data): centralize sample-dataset metadata in a manifest (#774)#886

feat(data): centralize sample-dataset metadata in a manifest (#774)#886
henrydingliu merged 3 commits into
casact:mainfrom
SaguaroDev:774-centralize-sample-metadata

SaguaroDev commented May 31, 2026 •

edited by cursor Bot

Loading

Uh oh!

codecov Bot commented May 31, 2026 •

edited

Loading

Uh oh!

henrydingliu commented May 31, 2026

Uh oh!

SaguaroDev commented May 31, 2026

Uh oh!

henrydingliu May 31, 2026

Uh oh!

SaguaroDev May 31, 2026

Uh oh!

henrydingliu May 31, 2026

Uh oh!

SaguaroDev May 31, 2026

Uh oh!

henrydingliu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -4,54 +4,55 @@ Below is the list of all datasets that come included with the `chainladder` pack

		You can load any dataset with `cl.load_sample(...)` such as `cl.load_sample("abc")`.

Conversation

SaguaroDev commented May 31, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

henrydingliu commented May 31, 2026

Uh oh!

SaguaroDev commented May 31, 2026

Uh oh!

henrydingliu May 31, 2026

Choose a reason for hiding this comment

Uh oh!

SaguaroDev May 31, 2026

Choose a reason for hiding this comment

Uh oh!

henrydingliu May 31, 2026

Choose a reason for hiding this comment

Uh oh!

SaguaroDev May 31, 2026

Choose a reason for hiding this comment

Uh oh!

henrydingliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SaguaroDev commented May 31, 2026 •

edited by cursor Bot

Loading

codecov Bot commented May 31, 2026 •

edited

Loading