Skip to content

Swap out patsy for formulae #463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ksolarski
Copy link

@ksolarski ksolarski commented Apr 21, 2025

Solving issue #386

Starting with DiD, will continue with other methods if you with general design @drbenvincent

Seems like the key practical difference between formulae and patsy is lack of build_design_matrices method in formulae. User has to then provide formula again.


📚 Documentation preview 📚: https://causalpy--463.org.readthedocs.build/en/463/

@drbenvincent
Copy link
Collaborator

Cool. Thanks @ksolarski, just a quick reply from my phone...

Don't do this for the synthetic control because I have an in progress PR that will change it. It won't have a formula input.

But can I just get some clarification... does this change the API? Can we get the exact same functionality? If not, let's think again.

Will try to look at the code properly when I can 👍🏻

@drbenvincent
Copy link
Collaborator

I can't find where I saw it in the patsy docs at this point. But I think one of the things that build_design_matrices did was to ensure that predictions on new/out of sample data are correct. For example, you could get a situation where you don't have all levels of a categorical variable in one predictor for out of sample data. So I think if you to it naively, you can get silent errors.

I'm not 100% sure that this is a problem, and apologies I can't find the relevant part in the docs. But does my concern make sense?

@ksolarski
Copy link
Author

You're right, Patsy has the power of preserving the transformation / encoding of variables through build_design_matrices method. There's no equivalent way in formulae so it's certainly not straightforward to copy paste the current behaviour with formulae.

However, Patsy repo suggests migration to https://github.com/matthewwardrop/formulaic instead, which is capable of "reusing the encoding choices made during conversion of one data-set on other datasets." (see https://matthewwardrop.github.io/formulaic/latest/). There's also a migration guide from Patsy to Formulaic to switch would be easy. It also supports many operators: https://matthewwardrop.github.io/formulaic/latest/guides/grammar/

Did you check out this library before? What do you think about using this instead of formulae?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants