-
Notifications
You must be signed in to change notification settings - Fork 9
Approx median fixes #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sastoudt
wants to merge
13
commits into
datadesk:main
Choose a base branch
from
sastoudt:approxMedianFixes
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
bea3b1e
moe and jam adds, need to work out edge cases
sastoudt 5b0b35c
nan to none and tests for cases that are working
sastoudt 0669165
stop averaging jam values in moe, tests that reflect this change
sastoudt 83cb895
add comments for simulation jam choices
sastoudt 6128fe1
make different jam value warnings, tests for warnings
sastoudt 02af661
error if only one jam value given and two are needed, plus test
sastoudt 0a31f3c
documentation
sastoudt 742ad1e
add example to readme, spacing tests
sastoudt 438f8f8
update examples, consistent call of functions
sastoudt c910c0c
take max of zero and simulated n to avoid negative numbers
sastoudt 7e211a6
fix none v. nan
sastoudt c23809d
documentation tweaks
sastoudt 5cb87b9
fix tests and bug in none handling
sastoudt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -95,10 +95,10 @@ Expects a list of dictionaries that divide the full range of data values into co | |
| dict(min=150000, max=199999, n=6931136, moe=37236), | ||
| dict(min=200000, max=1000000, n=7465517, moe=42206) | ||
| ] | ||
| >>> approximate_mean(income) | ||
| >>> census_data_aggregator.approximate_mean(income) | ||
| (98045.44530685373, 194.54892406267754) | ||
|
|
||
| Note that this function expects you to submit a lower bound for the smallest bin and an upper bound for the largest bin. This is often not available for ACS datasets like income. We recommend experimenting with different lower and upper bounds to assess its effect on the resulting mean. | ||
| Note that, unlike `approximate_median` this function expects you to submit a lower bound for the smallest bin and an upper bound for the largest bin. This is because the Census's jam value approach is only used for median calculations. We recommend experimenting with different lower and upper bounds to assess its effect on the resulting mean. | ||
|
|
||
| By default the simulation is run 50 times, which can take as long as a minute. The number of simulations can be changed by setting the `simulation` keyword argument. | ||
|
|
||
|
|
@@ -110,7 +110,7 @@ The simulation assumes a uniform distribution of values within each bin. In some | |
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> approximate_mean(income, pareto=True) | ||
| >>> census_data_aggregator.approximate_mean(income, pareto=True) | ||
| (60364.96525340687, 58.60735554621351) | ||
|
|
||
| Also, due to the stochastic nature of the simulation approach, you will need to set a seed before running this function to ensure replicability. | ||
|
|
@@ -119,10 +119,10 @@ Also, due to the stochastic nature of the simulation approach, you will need to | |
|
|
||
| >>> import numpy | ||
| >>> numpy.random.seed(711355) | ||
| >>> approximate_mean(income, pareto=True) | ||
| >>> census_data_aggregator.approximate_mean(income, pareto=True) | ||
| (60364.96525340687, 58.60735554621351) | ||
| >>> numpy.random.seed(711355) | ||
| >>> approximate_mean(income, pareto=True) | ||
| >>> census_data_aggregator.approximate_mean(income, pareto=True) | ||
| (60364.96525340687, 58.60735554621351) | ||
|
|
||
|
|
||
|
|
@@ -131,43 +131,52 @@ Approximating medians | |
|
|
||
| Estimate a median and approximate the margin of error. Follows the U.S. Census Bureau's official guidelines for estimation. Useful for generating medians for measures like household income and age when aggregating census geographies. | ||
|
|
||
| Expects a list of dictionaries that divide the full range of data values into continuous categories. Each dictionary should have three keys: | ||
| Expects a list of dictionaries that divide the full range of data values into continuous categories. The first `min` and the last `max` should be `None` since we typically do not know the boundaries for the top and bottom bins (e.g. income). If these values are actually known (e.g. lower bound for age), the known value can replace `None.` Each dictionary should have three keys with an optional fourth key for margin of error inputs: | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - key | ||
| - value | ||
| * - min | ||
| - The minimum value of the range | ||
| - The minimum value of the range (if unknown use `math.nan`) | ||
| * - max | ||
| - The maximum value of the range | ||
| - The maximum value of the range (if unknown use `math.nan`) | ||
| * - n | ||
| - The number of people, households or other units in the range | ||
| * - moe (optional) | ||
| - The `n` value's associated margin of error. If given as an input, a simulation approach will be used to estimate the new margin of error. | ||
|
|
||
|
|
||
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> household_income_la_2013_acs1 = [ | ||
| dict(min=2499, max=9999, n=1382), | ||
| dict(min=10000, max=14999, n=2377), | ||
| dict(min=15000, max=19999, n=1332), | ||
| dict(min=20000, max=24999, n=3129), | ||
| dict(min=25000, max=29999, n=1927), | ||
| dict(min=30000, max=34999, n=1825), | ||
| dict(min=35000, max=39999, n=1567), | ||
| dict(min=40000, max=44999, n=1996), | ||
| dict(min=45000, max=49999, n=1757), | ||
| dict(min=50000, max=59999, n=3523), | ||
| dict(min=60000, max=74999, n=4360), | ||
| dict(min=75000, max=99999, n=6424), | ||
| dict(min=100000, max=124999, n=5257), | ||
| dict(min=125000, max=149999, n=3485), | ||
| dict(min=150000, max=199999, n=2926), | ||
| dict(min=200000, max=250001, n=4215) | ||
| ] | ||
|
|
||
| For a margin of error to be returned, a sampling percentage must be provided to calculate the standard error. The sampling percentage represents what proportion of the population that participated in the survey. Here are the values for some common census surveys. | ||
| >>> median_with_moe_example = [ | ||
| dict(min=None, max=9999, n=6, moe=1), | ||
| dict(min=10000, max=14999, n=1, moe=1), | ||
| dict(min=15000, max=19999, n=8, moe=1), | ||
| dict(min=20000, max=24999, n=7, moe=1), | ||
| dict(min=25000, max=29999, n=2, moe=1), | ||
| dict(min=30000, max=34999, n=900, moe=8), | ||
| dict(min=35000, max=39999, n=7, moe=1), | ||
| dict(min=40000, max=44999, n=4, moe=1), | ||
| dict(min=45000, max=49999, n=8, moe=1), | ||
| dict(min=50000, max=59999, n=6, moe=1), | ||
| dict(min=60000, max=74999, n=7, moe=1), | ||
| dict(min=75000, max=99999, n=2, moe=0.25), | ||
| dict(min=100000, max=124999, n=7, moe=1), | ||
| dict(min=125000, max=149999, n=10, moe=1), | ||
| dict(min=150000, max=199999, n=8, moe=1), | ||
| dict(min=200000, max=None, n=18, moe=10) | ||
| ] | ||
|
|
||
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> census_data_aggregator.approximate_median(median_with_moe_example, sampling_percentage=2.5) | ||
| (32646.07020990552, 26.638686513280845) | ||
|
|
||
| In the case without margin of error inputs, a sampling percentage must be provided to in order for a margin of error to be returned. The sampling percentage represents what proportion of the population that participated in the survey. Here are the values for some common census surveys. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
@@ -183,20 +192,90 @@ For a margin of error to be returned, a sampling percentage must be provided to | |
| * - Five-year ACS | ||
| - 12.5 | ||
|
|
||
| .. code-block:: python | ||
| If you do not provide the sampling percentage value to the function, no margin of error will be returned. | ||
|
|
||
| >>> census_data_aggregator.approximate_median(household_income_Los_Angeles_County_2013_acs1, sampling_percentage=2.5) | ||
| 70065.84266055046, 3850.680465234964 | ||
| .. code-block:: python | ||
|
|
||
| If you do not provide the value to the function, no margin of error will be returned. | ||
| >>> median_without_moe_example = [ | ||
| dict(min=None, max=9999, n=6), | ||
| dict(min=10000, max=14999, n=1), | ||
| dict(min=15000, max=19999, n=8), | ||
| dict(min=20000, max=24999, n=7), | ||
| dict(min=25000, max=29999, n=2), | ||
| dict(min=30000, max=34999, n=900), | ||
| dict(min=35000, max=39999, n=7), | ||
| dict(min=40000, max=44999, n=4), | ||
| dict(min=45000, max=49999, n=8), | ||
| dict(min=50000, max=59999, n=6), | ||
| dict(min=60000, max=74999, n=7), | ||
| dict(min=75000, max=99999, n=2), | ||
| dict(min=100000, max=124999, n=7), | ||
| dict(min=125000, max=149999, n=10), | ||
| dict(min=150000, max=199999, n=8), | ||
| dict(min=200000, max=None, n=18) | ||
| ] | ||
|
|
||
| >>> census_data_aggregator.approximate_median(median_without_moe_example) | ||
| 32646.69277777778, None | ||
|
|
||
| If the data being approximated comes from PUMS, an additional design factor must also be provided. | ||
| The design factor is a statistical input used to tailor the estimate to the variance of the dataset. | ||
| Find the value for the dataset you are estimating by referring to `the bureau's reference material <https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.html>`_. | ||
|
|
||
| If you have an associated "jam values" for your dataset provided in the `American Community Survey's technical documentation <https://www.documentcloud.org/documents/6165752-2017-SummaryFile-Tech-Doc.html#document/p20/a508561>`_, input the pair as a list to the `jam_values` keyword argument. | ||
| Then if the median falls in the first or last bin, the jam value will be returned instead of `None`. | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| >>> census_data_aggregator.approximate_median(household_income_Los_Angeles_County_2013_acs1) | ||
| 70065.84266055046, None | ||
| >>> jam_without_simulation = [ | ||
| dict(min=None, max=9999, n=6), | ||
| dict(min=10000, max=14999, n=1), | ||
| dict(min=15000, max=19999, n=8), | ||
| dict(min=20000, max=24999, n=7), | ||
| dict(min=25000, max=29999, n=2), | ||
| dict(min=30000, max=34999, n=9), | ||
| dict(min=35000, max=39999, n=7), | ||
| dict(min=40000, max=44999, n=4), | ||
| dict(min=45000, max=49999, n=8), | ||
| dict(min=50000, max=59999, n=6), | ||
| dict(min=60000, max=74999, n=7), | ||
| dict(min=75000, max=99999, n=2), | ||
| dict(min=100000, max=124999, n=7), | ||
| dict(min=125000, max=149999, n=10), | ||
| dict(min=150000, max=199999, n=8), | ||
| dict(min=200000, max=None, n=186) | ||
| ] | ||
| >>> import numpy | ||
| >>> census_data_aggregator.approximate_median(jam_without_simulation, sampling_percentage=5*2.5,jam_values=[2599, 200001]) | ||
| (200001, None) | ||
|
|
||
| If the `n` values have an associated margin of error, a simulation based approach will be used to estimate the new margin of error. The `simulations` keyword argument controls the number of simulations to run and defaults to 50. | ||
| Jam values will not be used in the simulation approach. If the estimated median falls in the lower or upper bin, the estimate returned will be `None`. | ||
|
|
||
|
|
||
| If the data being approximated comes from PUMS, an additional design factor must also be provided. The design factor is a statistical input used to tailor the estimate to the variance of the dataset. Find the value for the dataset you are estimating by referring to `the bureau's reference material <https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.html>`_. | ||
| .. code-block:: python | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we believe that the moe-based method is superior, we should make it the first and default example. |
||
|
|
||
| >>> simulation_with_jam = [ | ||
| dict(min=None, max=9999, n=6, moe=1), | ||
| dict(min=10000, max=14999, n=1, moe=1), | ||
| dict(min=15000, max=19999, n=8, moe=1), | ||
| dict(min=20000, max=24999, n=7, moe=1), | ||
| dict(min=25000, max=29999, n=2, moe=1), | ||
| dict(min=30000, max=34999, n=90, moe=8), | ||
| dict(min=35000, max=39999, n=7, moe=1), | ||
| dict(min=40000, max=44999, n=4, moe=1), | ||
| dict(min=45000, max=49999, n=8, moe=1), | ||
| dict(min=50000, max=59999, n=6, moe=1), | ||
| dict(min=60000, max=74999, n=7, moe=1), | ||
| dict(min=75000, max=99999, n=2, moe=0.25), | ||
| dict(min=100000, max=124999, n=7, moe=1), | ||
| dict(min=125000, max=149999, n=10, moe=1), | ||
| dict(min=150000, max=199999, n=8, moe=1), | ||
| dict(min=200000, max=None, n=186, moe=10) | ||
| ] | ||
| >>> import numpy | ||
| >>> census_data_aggregator.approximate_median(simulation_with_jam, simulations=50, jam_values=[2499, 200001]) | ||
| (None, None) | ||
|
|
||
| Approximating percent change | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a jam values example here that does not use the simulation method.