Creation Demo Notebooks
Using Engines and Degraders to Generate Galaxy Samples with Errors and Biases
Using Engines and Degraders to Generate Galaxy Samples with Errors and Biases¶
author: John Franklin Crenshaw, Sam Schmidt, Eric Charles, others...
last run successfully: March 16, 2022
This notebook demonstrates how to use a RAIL Engines to create galaxy samples, and how to use Degraders to add various errors and biases to the sample.
Note that in the parlance of the Creation Module, "degradation" is any post-processing that occurs to the "true" sample generated by the Engine. This can include adding photometric errors, applying quality cuts, introducing systematic biases, etc.
In this notebook, we will first learn how to draw samples from a RAIL Engine object. Then we will demonstrate how to use the following RAIL Degraders:
- LSSTErrorModel, which adds photometric errors
- QuantityCut, which applies cuts to the specified columns of the sample
- InvRedshiftIncompleteness, which introduces sample incompleteness
- LineConfusion, which introduces spectroscopic errors
Throughout the notebook, we will show how you can chain all these Degraders together to build a more complicated degrader. Hopefully, this will allow you to see how you can build your own degrader.
Note on generating redshift posteriors: regardless of what Degraders you apply, when you use a Creator to estimate posteriors, the posteriors will always be calculated with respect to the "true" distribution. This is the whole point of the Creation Module -- you can generate degraded samples for which we still have access to the true posteriors. For an example of how to calculate posteriors, see posterior-demo.ipynb
.
import matplotlib.pyplot as plt
from pzflow.examples import get_example_flow
from rail.creation.engines.flowEngine import FlowCreator
from rail.creation.degradation import (
InvRedshiftIncompleteness,
LineConfusion,
LSSTErrorModel,
QuantityCut,
)
from rail.core.stage import RailStage
Specify the path to the pretrained 'pzflow' used to generate samples¶
import pzflow
import os
flow_file = os.path.join(os.path.dirname(pzflow.__file__), 'example_files', 'example-flow.pzflow.pkl')
We'll start by setting up the Rail data store. RAIL uses ceci, which is designed for pipelines rather than interactive notebooks, the data store will work around that and enable us to use data interactively. See the rail/examples/goldenspike/goldenspike.ipynb
example notebook for more details on the Data Store.
DS = RailStage.data_store
DS.__class__.allow_overwrite = True
"True" Engine¶
First, let's make an Engine that has no degradation. We can use it to generate a "true" sample, to which we can compare all the degraded samples below.
Note: in this example, we will use a normalizing flow engine from the pzflow package. However, everything in this notebook is totally agnostic to what the underlying engine is.
The Engine is a type of RailStage object, so we can make one using the RailStage.make_stage
function for the class of Engine that we want. We then pass in the configuration parameters as arguments to make_stage
.
n_samples = int(1e5)
flowCreator_truth = FlowCreator.make_stage(name='truth', model=flow_file, n_samples=n_samples)
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Inserting handle into data store. model: /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pzflow/example_files/example-flow.pzflow.pkl, truth
Let's check that the Engine correctly read the underlying PZ Flow object¶
flowCreator_truth.get_data('model')
<pzflow.flow.Flow at 0x7f1b0495e6a0>
Now we invoke the sample
method to generate some samples¶
Note that this will return a DataHandle
object, which can keep both the data itself, and also the path to where the data is written. When talking to rail stages we can use this as though it were the underlying data and pass it as an argument. This allows the rail stages to keep track of where their inputs are coming from.
samples_truth = flowCreator_truth.sample(n_samples, seed=0)
print(samples_truth())
print("Data was written to ", samples_truth.path)
Inserting handle into data store. output_truth: inprogress_output_truth.pq, truth redshift u g r i z \ 0 0.890625 27.370831 26.712662 26.025223 25.327188 25.016500 1 1.978239 29.557049 28.361185 27.587231 27.238544 26.628109 2 0.974287 26.566015 25.937716 24.787413 23.872456 23.139563 3 1.317979 29.042730 28.274593 27.501106 26.648790 26.091450 4 1.386366 26.292624 25.774778 25.429958 24.806530 24.367950 ... ... ... ... ... ... ... 99995 2.147172 26.550978 26.349937 26.135286 26.082020 25.911032 99996 1.457508 27.362207 27.036276 26.823139 26.420132 26.110037 99997 1.372992 27.736044 27.271955 26.887581 26.416138 26.043434 99998 0.855022 28.044552 27.327116 26.599014 25.862331 25.592169 99999 1.723768 27.049067 26.526745 26.094595 25.642971 25.197956 y 0 24.926821 1 26.248560 2 22.832047 3 25.346500 4 23.700010 ... ... 99995 25.558136 99996 25.524904 99997 25.456165 99998 25.506388 99999 24.900501 [100000 rows x 7 columns] Data was written to output_truth.pq
Degrader 1: LSSTErrorModel¶
Now, we will demonstrate the LSSTErrorModel
, which adds photometric errors using a model similar to the model from Ivezic et al. 2019 (specifically, it uses the model from this paper, without making the high SNR assumption. To restore this assumption and therefore use the exact model from the paper, set highSNR=True
.)
Let's create an error model with the default settings:
errorModel = LSSTErrorModel.make_stage(name='error_model')
To see the details of the model, including the default settings we are using, you can just print the model:
errorModel
LSSTErrorModel parameters: Model for bands: u, g, r, i, z, y Using error type point Exposure time = 30.0 s Number of years of observations = 10.0 Mean visits per year per band: u: 5.6, g: 8.0, r: 18.4, i: 18.4, z: 16.0, y: 16.0 Airmass = 1.2 Irreducible system error = 0.005 Magnitudes dimmer than 30.0 are set to nan gamma for each band: u: 0.038, g: 0.039, r: 0.039, i: 0.039, z: 0.039, y: 0.039 The coadded 5-sigma limiting magnitudes are: u: 26.04, g: 27.29, r: 27.31, i: 26.87, z: 26.23, y: 25.30 The following single-visit 5-sigma limiting magnitudes are calculated using the parameters that follow them: u: 23.83, g: 24.90, r: 24.47, i: 24.03, z: 23.46, y: 22.53 Cm for each band: u: 23.09, g: 24.42, r: 24.44, i: 24.32, z: 24.16, y: 23.73 Median zenith sky brightness in each band: u: 22.99, g: 22.26, r: 21.2, i: 20.48, z: 19.6, y: 18.61 Median zenith seeing FWHM (in arcseconds) for each band: u: 0.81, g: 0.77, r: 0.73, i: 0.71, z: 0.69, y: 0.68 Extinction coefficient for each band: u: 0.491, g: 0.213, r: 0.126, i: 0.096, z: 0.069, y: 0.17
Now let's add this error model as a degrader and draw some samples with photometric errors.
samples_w_errs = errorModel(samples_truth)
samples_w_errs()
Inserting handle into data store. output_error_model: inprogress_output_error_model.pq, error_model
redshift | u | u_err | g | g_err | r | r_err | i | i_err | z | z_err | y | y_err | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.890625 | NaN | NaN | 26.562721 | 0.105583 | 26.084861 | 0.068194 | 25.340978 | 0.052257 | 25.021891 | 0.069445 | 25.047443 | 0.159796 |
1 | 1.978239 | NaN | NaN | 28.038419 | 0.362520 | 27.490722 | 0.229680 | 28.102581 | 0.525461 | 26.066428 | 0.172483 | 25.834953 | 0.307316 |
2 | 0.974287 | 26.873697 | 0.389236 | 25.882633 | 0.057988 | 24.797719 | 0.021944 | 23.873355 | 0.014716 | 23.128763 | 0.013557 | 22.861474 | 0.023448 |
3 | 1.317979 | 27.914048 | 0.817339 | 27.705399 | 0.277971 | 27.204204 | 0.180633 | 26.703293 | 0.172092 | 25.931166 | 0.153677 | 25.795159 | 0.297649 |
4 | 1.386366 | 26.336934 | 0.253759 | 25.750773 | 0.051593 | 25.483414 | 0.039993 | 24.809233 | 0.032626 | 24.301733 | 0.036670 | 23.576059 | 0.043921 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | 2.147172 | 26.643909 | 0.325091 | 26.212954 | 0.077661 | 26.220695 | 0.076900 | 26.027656 | 0.095907 | 26.102146 | 0.177794 | 25.635739 | 0.261534 |
99996 | 1.457508 | 26.621966 | 0.319467 | 26.982388 | 0.151845 | 26.542811 | 0.102093 | 26.446734 | 0.138137 | 25.959232 | 0.157414 | 25.461991 | 0.226646 |
99997 | 1.372992 | 26.679523 | 0.334399 | 27.416936 | 0.219265 | 27.042587 | 0.157411 | 26.480484 | 0.142215 | 26.165722 | 0.187622 | 24.902178 | 0.141068 |
99998 | 0.855022 | 26.886674 | 0.393155 | 27.355825 | 0.208363 | 26.494891 | 0.097896 | 25.783669 | 0.077364 | 25.514723 | 0.107157 | 25.333237 | 0.203557 |
99999 | 1.723768 | 27.557109 | 0.643300 | 26.442709 | 0.095055 | 26.216528 | 0.076618 | 25.710465 | 0.072517 | 25.169914 | 0.079153 | 24.799610 | 0.129108 |
100000 rows × 13 columns
Notice some of the magnitudes are NaN's. These are non-detections. This means those observed magnitudes were beyond the 30mag limit that is default in LSSTErrorModel
.
You can change this limit and the corresponding flag by setting magLim=...
and ndFlag=...
in the constructor for LSSTErrorModel
.
Let's plot the error as a function of magnitude
fig, ax = plt.subplots(figsize=(5, 4), dpi=100)
for band in "ugrizy":
# pull out the magnitudes and errors
mags = samples_w_errs.data[band].to_numpy()
errs = samples_w_errs.data[band + "_err"].to_numpy()
# sort them by magnitude
mags, errs = mags[mags.argsort()], errs[mags.argsort()]
# plot errs vs mags
ax.plot(mags, errs, label=band)
ax.legend()
ax.set(xlabel="Magnitude (AB)", ylabel="Error (mags)")
plt.show()
You can see that the photometric error increases as magnitude gets dimmer, just like you would expect. Notice, however, that we have galaxies as dim as magnitude 30. This is because the Flow produces a sample much deeper than the LSST 5-sigma limiting magnitudes. There are no galaxies dimmer than magnitude 30 because LSSTErrorModel sets magnitudes > 30 equal to NaN (the default flag for non-detections).
Degrader 2: QuantityCut¶
Recall how the sample above has galaxies as dim as magnitude 30. This is well beyond the LSST 5-sigma limiting magnitudes, so it will be useful to apply cuts to the data to filter out these super-dim samples. We can apply these cuts using the QuantityCut
degrader. This degrader will cut out any samples that do not pass all of the specified cuts.
Let's make and run degraders that first adds photometric errors, then cuts at i<25.3, which is the LSST gold sample.
gold_cut = QuantityCut.make_stage(name='cuts', cuts={"i": 25.3})
Now we can stick this into a Creator and draw a new sample
samples_gold_w_errs = gold_cut(samples_w_errs)
Inserting handle into data store. output_cuts: inprogress_output_cuts.pq, cuts
If you look at the i column, you will see there are no longer any samples with i > 25.3. The number of galaxies returned has been nearly cut in half from the input sample and, unlike the LSSTErrorModel degrader, is not equal to the number of input objects. Users should note that with degraders that remove galaxies from the sample the size of the output sample will not equal that of the input sample.
One more note: it is easy to use the QuantityCut degrader as a SNR cut on the magnitudes. The magnitude equation is $m = -2.5 \log(f)$. Taking the derivative, we have
$$
dm = \frac{2.5}{\ln(10)} \frac{df}{f} = \frac{2.5}{\ln(10)} \frac{1}{\mathrm{SNR}}.
$$
So if you want to make a cut on galaxies above a certain SNR, you can make a cut
$$
dm < \frac{2.5}{\ln(10)} \frac{1}{\mathrm{SNR}}.
$$
For example, an SNR cut on the i band would look like this: QuantityCut({"i_err": 2.5/np.log(10) * 1/SNR})
.
Degrader 3: InvRedshiftIncompleteness¶
Next, we will demonstrate the InvRedshiftIncompleteness
degrader. It applies a selection function, which keeps galaxies with probability $p_{\text{keep}}(z) = \min(1, \frac{z_p}{z})$, where $z_p$ is the ''pivot'' redshift. We'll use $z_p = 0.8$.
inv_incomplete = InvRedshiftIncompleteness.make_stage(name='incompleteness', pivot_redshift=0.8)
samples_incomplete_gold_w_errs = inv_incomplete(samples_gold_w_errs)
Inserting handle into data store. output_incompleteness: inprogress_output_incompleteness.pq, incompleteness
Let's plot the redshift distributions of the samples we have generated so far:
fig, ax = plt.subplots(figsize=(5, 4), dpi=100)
zmin = 0
zmax = 2.5
hist_settings = {
"bins": 50,
"range": (zmin, zmax),
"density": True,
"histtype": "step",
}
ax.hist(samples_truth()["redshift"], label="Truth", **hist_settings)
ax.hist(samples_gold_w_errs()["redshift"], label="Gold", **hist_settings)
ax.hist(samples_incomplete_gold_w_errs()["redshift"], label="Incomplete Gold", **hist_settings)
ax.legend(title="Sample")
ax.set(xlim=(zmin, zmax), xlabel="Redshift", ylabel="Galaxy density")
plt.show()
You can see that the Gold sample has significantly fewer high-redshift galaxies than the truth. This is because many of the high-redshift galaxies have i > 25.3.
You can further see that the Incomplete Gold sample has even fewer high-redshift galaxies. This is exactly what we expected from this degrader.
Degrader 4: LineConfusion¶
LineConfusion
is a degrader that simulates spectroscopic errors resulting from the confusion of different emission lines.
For this example, let's use the degrader to simulate a scenario in which which 2% of [OII] lines are mistaken as [OIII] lines, and 1% of [OIII] lines are mistaken as [OII] lines. (note I do not know how realistic this scenario is!)
OII = 3727
OIII = 5007
lc_2p_0II_0III = LineConfusion.make_stage(name='lc_2p_0II_0III',
true_wavelen=OII, wrong_wavelen=OIII, frac_wrong=0.02)
lc_1p_0III_0II = LineConfusion.make_stage(name='lc_1p_0III_0II',
true_wavelen=OIII, wrong_wavelen=OII, frac_wrong=0.01)
samples_conf_inc_gold_w_errs = lc_1p_0III_0II(lc_2p_0II_0III(samples_incomplete_gold_w_errs))
Inserting handle into data store. output_lc_2p_0II_0III: inprogress_output_lc_2p_0II_0III.pq, lc_2p_0II_0III Inserting handle into data store. output_lc_1p_0III_0II: inprogress_output_lc_1p_0III_0II.pq, lc_1p_0III_0II
Let's plot the redshift distributions one more time
fig, ax = plt.subplots(figsize=(5, 4), dpi=100)
zmin = 0
zmax = 2.5
hist_settings = {
"bins": 50,
"range": (zmin, zmax),
"density": True,
"histtype": "step",
}
ax.hist(samples_truth()["redshift"], label="Truth", **hist_settings)
ax.hist(samples_gold_w_errs()["redshift"], label="Gold", **hist_settings)
ax.hist(samples_incomplete_gold_w_errs()["redshift"], label="Incomplete Gold", **hist_settings)
ax.hist(samples_conf_inc_gold_w_errs()["redshift"], label="Confused Incomplete Gold", **hist_settings)
ax.legend(title="Sample")
ax.set(xlim=(zmin, zmax), xlabel="Redshift", ylabel="Galaxy density")
plt.show()
You can see that the redshift distribution of this new sample is essentially identical to the Incomplete Gold sample, with small perturbations that result from the line confusion.
However the real impact of this degrader isn't on the redshift distribution, but rather that it introduces erroneous spec-z's into the photo-z training sets! To see the impact of this effect, let's plot the true spec-z's as present in the Incomplete Gold sample, vs the spec-z's listed in the new sample with Oxygen Line Confusion.
fig, ax = plt.subplots(figsize=(6, 6), dpi=85)
ax.scatter(samples_incomplete_gold_w_errs()["redshift"], samples_conf_inc_gold_w_errs()["redshift"],
marker=".", s=1)
ax.set(
xlim=(0, 2.5), ylim=(0, 2.5),
xlabel="True spec-z (in Incomplete Gold sample)",
ylabel="Spec-z listed in the Confused sample",
)
plt.show()
Now we can clearly see the spec-z errors! The galaxies above the line y=x are the [OII] -> [OIII] galaxies, while the ones below are the [OIII] -> [OII] galaxies.
Grid Selection Degrader to Emulate HSC Training Samples
example_GridSelection_for_HSC.ipynb
GridSelection degrader to emulate HSC training samples¶
The GridSelection degrader can be used to model the spectroscopic success rates in training sets based on real data. Given a 2-dimensional grid of spec-z success ratio as a function of two variables (often magnitude or color), the degrader will draw the appropriate fraction of samples from the input data and return a sample with incompleteness modeled. An additional redshift cut can also be applied, where all redshifts above the cutoff are also removed from the sample.
The degrader takes the following arguments:
ratio_file
: the name of the file containing the 2-dimensional grid of spec-z successrandom_seed
: random seed to feed to numpy for reproducibilitysettings_file
: path to the pickled file containing settings that define the 2-dimensional grid. There is a mechanism to make cuts either on a single column from the input data, or a difference (i.e. either a magnitude or a color as a difference of two magnitudes). The parameters in the settings file are:x_band_1
: column name for the x-axis variable from ratios grid.x_band_2
: [optional] column name for the second x-axis variable. If x_band_2 is set to '' then it is assumed that the x-axis is parameterized in terms of x_band_1. If x_band_2 is not '' then the x-axis is compared against (x_band_1 - x_band_2)
y_band_1
andy_band_2
: analagous tox_band_1
andx_band_2
but for the y-axisx_limits
andy_limits
: 2-element lists with the minimum and maximum values for the grid, e.g. [13, 26] if the limits in the x-axis are between magnitudes of 13 and 26.
In this quick notebook we'll create a grid of mock galaxies on the same grid on which the HyperSuprimeCam Survey (HSC) spectroscopic success has been parameterized by Irene Moskowitz (and available in RAIL/examples/creation/data
), and plot the success rate to visualize the spectroscopic success rate for HSC.
import rail
import os
import matplotlib.pyplot as plt
import numpy as np
import tables_io
import pandas as pd
#from rail.core.data import TableHandle
from rail.core.stage import RailStage
%matplotlib inline
DS = RailStage.data_store
DS.__class__.allow_overwrite = True
Let's make a grid of fake data matching the grid used by HSC. The 2D grid of spec-z success in this case is parameterized in terms of g-z
color vs i-band
magnitude, with g-z
between [-2, 6]
and i-band
magnitude spanning [13, 26]
. Let's generate 100 identical objects in each of the 100x100=10,000 grid cells, for a total of 1,000,000 mock galaxies. The only quantities that we need are g
, i
, z
magnitudes and a redshift
that we can just set to a random number between 1 and 2.5. The only thing that we really need is consistent g-z and i-mag values, so we can just set g to 20.0 in all circumstances.
gridgz, gridi = np.meshgrid(np.linspace(-1.98, 5.98, 100), np.linspace(13.0325, 25.9675, 100))
i = gridi.flatten()
gz = gridgz.flatten()
g = np.full_like(i, 20.0, dtype=np.double)
z = g - gz
redshift = np.round(np.random.uniform(size=len(i))*2.5, 2)
mockdict = {}
for label, item in zip(['i', 'gz', 'g', 'z', 'redshift'], [i, gz, g, z, redshift]):
mockdict[f'{label}'] = np.repeat(item, 100).flatten()
df = pd.DataFrame(mockdict)
df.head()
i | gz | g | z | redshift | |
---|---|---|---|---|---|
0 | 13.0325 | -1.98 | 20.0 | 21.98 | 1.13 |
1 | 13.0325 | -1.98 | 20.0 | 21.98 | 1.13 |
2 | 13.0325 | -1.98 | 20.0 | 21.98 | 1.13 |
3 | 13.0325 | -1.98 | 20.0 | 21.98 | 1.13 |
4 | 13.0325 | -1.98 | 20.0 | 21.98 | 1.13 |
Now, let's import the GridSelection degrader and set up the config dict parameters. We will set a redshift cut of 5.1 so as to not cut any of our mock galaxies, if you would want a redshift cut, you would simply change this parameter as desired. There is an optional color_redshift_cut
that scales the number of galaxies kept, we will turn this off. There is also a percentile_cut
that computes percentiles in redshift of each cell and removes the highest redshift galaxies, as those are usually the most likely to not be recovered by a spectroscopic survey. For simplicity, we will set percentile_cut to 100. to not remove any galaxies with this cut.
The ratio file for HSC is located in the RAIL/examples/creation/data
directory, as we are in RAIL/examples/creation
folder with this demo the paths for the ratio_file
and settings_file
are set accordingly.
We will set a random seed for reproducibility, and set the output file to write our incomplete catalog to "test_hsc.pq".
from rail.creation.degradation.grid_selection import GridSelection
configdict = dict(redshift_cut=5.1, ratio_file='../../src/rail/examples/creation/data/hsc_ratios_and_specz.hdf5',
settings_file='../../src/rail/examples/creation/data/HSC_grid_settings.pkl', percentile_cut=100.,
color_redshift_cut=False,
output="test_hsc.pq", random_seed=66)
hsc_selecter = GridSelection.make_stage(name='hsc_cutter', **configdict)
Let's run the code and see how long it takes:
%%time
trim_data = hsc_selecter(df)
Inserting handle into data store. input: None, hsc_cutter Inserting handle into data store. output_hsc_cutter: inprogress_test_hsc.pq, hsc_cutter CPU times: user 4.51 s, sys: 357 ms, total: 4.86 s Wall time: 4.87 s
This took 10.1s on my home computer, not too bad for 4 million mock galaxies
trim_data().info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 181751 entries, 343119 to 924109 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 i 181751 non-null float64 1 gz 181751 non-null float64 2 g 181751 non-null float64 3 z 181751 non-null float64 4 redshift 181751 non-null float64 5 max_specz 181751 non-null float64 dtypes: float64(6) memory usage: 9.7 MB
And we see that we've kept 625,677 out of the 4,000,000 galaxies in the initial sample, so about 15% of the initial sample. To visualize our cuts, let's read in the success ratios file and plot our sample overlaid with an alpha of 0.01, that way the strength of the black dot will give a visual indication of how many galaxies in each cell we've kept.
# compare to sum of ratios * 100
ratio_file='../../src/rail/examples/creation/data/hsc_ratios_and_specz.hdf5'
allrats = tables_io.read(ratio_file)['ratios']
trim_data()['color'] = trim_data()['g'] - trim_data()['z']
bgr, bi = np.meshgrid(np.linspace(-2, 6, 101), np.linspace(13, 26, 101))
bratios = tables_io.read("../../src/rail/examples/creation/data/hsc_ratios_and_specz.hdf5")['ratios']
plt.figure(figsize=(18,18))
plt.pcolormesh(bi, bgr, bratios.T, cmap='jet', vmin=0, vmax=1,alpha=0.4)
plt.scatter(trim_data()['i'], trim_data()['color'], s=3, c='k',alpha =.015)
plt.xlabel("i-band mag", fontsize=18)
plt.ylabel("g-i", fontsize=18)
Text(0, 0.5, 'g-i')
The colormap shows the HSC ratios and the strenth of the black dots shows how many galaxies were actually kept. We see perfect agreement between our predicted ratios and the actual number of galaxies kept, the degrader is functioning properly, and we see a nice visual representation of the resulting spectroscopic sample incompleteness.
# as our demo data is just a grid of mock points, let's remove it so we do not leave a large file sitting around
os.remove("./test_hsc.pq")
Spectroscopic Selection Degrader to Emulate zCOSMOS Training Samples
example_SpecSelection_for_zCOSMOS.ipynb
Spectroscopic selection degrader to emulate zCOSMOS training samples¶
The spectroscopic_selection degrader can be used to model the spectroscopic success rates in training sets based on real data. Given a 2-dimensional grid of spec-z success ratio as a function of two variables (often magnitude, color, or redshift), the degrader will draw the appropriate fraction of samples from the input data and return a sample with incompleteness modeled.
The degrader takes the following arguments:
N_tot
: number of selected sourcesnondetect_val
: non detected magnitude value to be excluded (usually 99.0, -99.0 or NaN).downsample
: If true, downsample the selected sources into a total number of N_tot.success_rate_dir
: The path to the directory containing success rate files.colnames
: a dictionary that includes necessary columns (magnitudes, colors and redshift) for selection. For magnitudes, the keys are ugrizy; for colors, the keys are, for example, gr standing for g-r; for redshift, the key is 'redshift'. In this demo, zCOSMOS takes {'i':'i', 'redshift':'redshift'} as minimum necessary input
In this quick notebook we'll select galaxies based on zCOSMOS selection function.
import rail
import os
import matplotlib.pyplot as plt
import numpy as np
import tables_io
import pandas as pd
#from rail.core.data import TableHandle
from rail.core.stage import RailStage
%matplotlib inline
DS = RailStage.data_store
DS.__class__.allow_overwrite = True
Let's make fake data for zCOSMOS selection.
i = np.random.uniform(low=18, high=25.9675, size=(2000000,))
gz = np.random.uniform(low=-1.98, high=5.98, size=(2000000,))
u = np.full_like(i, 20.0, dtype=np.double)
g = np.full_like(i, 20.0, dtype=np.double)
r = np.full_like(i, 20.0, dtype=np.double)
y = np.full_like(i, 20.0, dtype=np.double)
z = g - gz
redshift = np.random.uniform(size=len(i)) * 2
standardize the column names
mockdict = {}
for label, item in zip(['u', 'g','r','i', 'z','y', 'redshift'], [u,g,r,i,z,y, redshift]):
mockdict[f'{label}'] = item
np.repeat(item, 100).flatten()
df = pd.DataFrame(mockdict)
df.head()
u | g | r | i | z | y | redshift | |
---|---|---|---|---|---|---|---|
0 | 20.0 | 20.0 | 20.0 | 21.608052 | 17.256106 | 20.0 | 0.601622 |
1 | 20.0 | 20.0 | 20.0 | 18.035623 | 14.795550 | 20.0 | 0.367423 |
2 | 20.0 | 20.0 | 20.0 | 20.508019 | 16.842229 | 20.0 | 0.125606 |
3 | 20.0 | 20.0 | 20.0 | 22.863798 | 14.556936 | 20.0 | 0.169077 |
4 | 20.0 | 20.0 | 20.0 | 25.918845 | 21.892444 | 20.0 | 1.602153 |
Now, let's import the spectroscopic_selections degrader for zCOSMOS.
The ratio file for zCOSMOS is located in the RAIL/src/rail/examples/creation/data/success_rate_data/
directory, as we are in RAIL/examples/creation
folder named zCOSMOS_success.txt
; the binning in i band and redshift are given in zCOSMOS_I_sampling.txt
and zCOSMOS_z_sampling.txt
.
We will set a random seed for reproducibility, and set the output file to write our incomplete catalog to "test_hsc.pq".
import sys
from rail.creation.degradation import spectroscopic_selections
from importlib import reload
from rail.creation.degradation.spectroscopic_selections import SpecSelection_zCOSMOS
zcosmos_selecter = SpecSelection_zCOSMOS.make_stage(downsample=False,
colnames={'i':'i','redshift':'redshift'})
Let's run the code and see how long it takes:
%%time
trim_data = zcosmos_selecter(df)
Inserting handle into data store. input: None, specselection_zCOSMOS Inserting handle into data store. output: inprogress_output.pq, specselection_zCOSMOS CPU times: user 2.26 s, sys: 119 ms, total: 2.38 s Wall time: 2.38 s
trim_data.data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 504375 entries, 0 to 1999999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 u 504375 non-null float64 1 g 504375 non-null float64 2 r 504375 non-null float64 3 i 504375 non-null float64 4 z 504375 non-null float64 5 y 504375 non-null float64 6 redshift 504375 non-null float64 dtypes: float64(7) memory usage: 30.8 MB
And we see that we've kept 503967 out of the 2,000,000 galaxies in the initial sample, so about 25% of the initial sample. To visualize our cuts, let's read in the success ratios file and plot our sample overlaid with an alpha of 0.05, that way the strength of the black dot will give a visual indication of how many galaxies in each cell we've kept.
# compare to sum of ratios * 100
ratio_file='../../src/rail/examples/creation/data/success_rate_data/zCOSMOS_success.txt'
ratios = np.loadtxt(ratio_file)
ibin_ = np.arange(18, 22.4, 0.01464226, dtype=np.float64)
zbin_ = np.arange(0, 1.4, 0.00587002, dtype=np.float64)
ibin, zbin = np.meshgrid(ibin_, zbin_)
trim_data.data
u | g | r | i | z | y | redshift | |
---|---|---|---|---|---|---|---|
0 | 20.0 | 20.0 | 20.0 | 21.608052 | 17.256106 | 20.0 | 0.601622 |
1 | 20.0 | 20.0 | 20.0 | 18.035623 | 14.795550 | 20.0 | 0.367423 |
2 | 20.0 | 20.0 | 20.0 | 20.508019 | 16.842229 | 20.0 | 0.125606 |
5 | 20.0 | 20.0 | 20.0 | 22.155690 | 14.762925 | 20.0 | 1.201553 |
7 | 20.0 | 20.0 | 20.0 | 18.417438 | 20.662582 | 20.0 | 0.401905 |
... | ... | ... | ... | ... | ... | ... | ... |
1999978 | 20.0 | 20.0 | 20.0 | 19.503473 | 21.386000 | 20.0 | 0.293330 |
1999980 | 20.0 | 20.0 | 20.0 | 21.321786 | 16.921940 | 20.0 | 1.232539 |
1999989 | 20.0 | 20.0 | 20.0 | 20.665312 | 14.531765 | 20.0 | 1.011998 |
1999993 | 20.0 | 20.0 | 20.0 | 22.361941 | 16.112841 | 20.0 | 0.418955 |
1999999 | 20.0 | 20.0 | 20.0 | 18.452306 | 15.162175 | 20.0 | 0.264633 |
504375 rows × 7 columns
plt.figure(figsize=(12,12))
plt.title('zCOSMOS', fontsize=20)
c = plt.pcolormesh(zbin, ibin, ratios.T, cmap='turbo',vmin=0, vmax=1, alpha=0.8)
plt.scatter(trim_data.data['redshift'], trim_data.data['i'], s=2, c='k',alpha =.05)
plt.xlabel("redshift", fontsize=15)
plt.ylabel("i band Magnitude", fontsize=18)
cb = plt.colorbar(c, label='success rate',orientation='horizontal', pad=0.1)
cb.set_label(label='success rate', size=15)
The colormap shows the zCOSMOS success ratios and the strenth of the black dots shows how many galaxies were actually kept. We see perfect agreement between our predicted ratios and the actual number of galaxies kept, the degrader is functioning properly, and we see a nice visual representation of the resulting spectroscopic sample incompleteness.
Using a Creator to Calculate True Posteriors for a Galaxy Sample
Using a Creator to Calculate True Posteriors for a Galaxy Sample¶
author: John Franklin Crenshaw, Sam Schmidt, Eric Charles, others...
last run successfully: March 16, 2022
This notebook demonstrates how to use a RAIL Engine to calculate true posteriors for galaxy samples drawn from the same Engine. Note that this notebook assumes you have already read through degradation-demo.ipynb
.
Calculating posteriors is more complicated than drawing samples, because it requires more knowledge of the engine that underlies the Engine. In this example, we will use the same engine we used in Degradation demo: FlowEngine
which wraps a normalizing flow from the pzflow package.
This notebook will cover three scenarios of increasing complexity:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pzflow.examples import get_example_flow
from rail.creation.engines.flowEngine import FlowCreator, FlowPosterior
from rail.creation.degradation import (
InvRedshiftIncompleteness,
LineConfusion,
LSSTErrorModel,
QuantityCut,
)
from rail.core.data import TableHandle
from rail.core.stage import RailStage
from rail.core.utilStages import ColumnMapper
import pzflow
import os
flow_file = os.path.join(os.path.dirname(pzflow.__file__), 'example_files', 'example-flow.pzflow.pkl')
We'll start by setting up the Rail data store. RAIL uses ceci, which is designed for pipelines rather than interactive notebooks, the data store will work around that and enable us to use data interactively. See the rail/examples/goldenspike/goldenspike.ipynb
example notebook for more details on the Data Store.
DS = RailStage.data_store
DS.__class__.allow_overwrite = True
1. Calculating posteriors without errors¶
For a basic first example, let's make a Creator with no degradation and draw a sample.
Note that the FlowEngine.sample
method is handing back a DataHandle
. When talking to rail stages we can use this as though it were the underlying data and pass it as an argument. This allows the rail stages to keep track of where their inputs are coming from.
n_samples=6
# create the FlowCreator
flowCreator = FlowCreator.make_stage(name='truth', model=flow_file, n_samples=n_samples)
# draw a few samples
samples_truth = flowCreator.sample(6, seed=0)
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Inserting handle into data store. model: /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pzflow/example_files/example-flow.pzflow.pkl, truth Inserting handle into data store. output_truth: inprogress_output_truth.pq, truth
Now, let's calculate true posteriors for this sample. Note the important fact here: these are literally the true posteriors for the sample because pzflow gives us direct access to the probability distribution from which the sample was drawn!
When calculating posteriors, the Engine will always require data
, which is a pandas DataFrame of the galaxies for which we are calculating posteriors (in out case the samples_truth
). Because we are using a FlowEngine
, we also must provide grid
, because FlowEngine
calculates posteriors over a grid of redshift values.
Let's calculate posteriors for every galaxy in our sample:
flow_post = FlowPosterior.make_stage(name='truth_post',
column='redshift',
grid = np.linspace(0, 2.5, 100),
marg_rules=dict(flag=np.nan,
u=lambda row: np.linspace(25, 31, 10)),
flow=flow_file)
pdfs = flow_post.get_posterior(samples_truth, column='redshift')
Inserting handle into data store. output_truth_post: inprogress_output_truth_post.hdf5, truth_post Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data'
Note that Creator returns the pdfs as a qp Ensemble:
pdfs.data
<qp.ensemble.Ensemble at 0x7f4a8800cfd0>
Let's plot these pdfs:
fig, axes = plt.subplots(2, 3, constrained_layout=True, dpi=120)
for i, ax in enumerate(axes.flatten()):
# plot the pdf
pdfs.data[i].plot_native(axes=ax)
# plot the true redshift
ax.axvline(samples_truth.data["redshift"][i], c="k", ls="--")
# remove x-ticks on top row
if i < 3:
ax.set(xticks=[])
# set x-label on bottom row
else:
ax.set(xlabel="redshift")
# set y-label on far left column
if i % 3 == 0:
ax.set(ylabel="p(z)")
The true posteriors are in blue, and the true redshifts are marked by the vertical black lines.
2. Calculating posteriors while convolving errors¶
Now, let's get a little more sophisticated.
Let's recreate the Engine/Degredation we were using at the end of the Degradation demo.
I will make one change however: the LSST Error Model sometimes results in non-detections for faint galaxies. These non-detections are flagged with NaN. Calculating posteriors for galaxies with missing magnitudes is more complicated, so for now, I will add one additional QuantityCut to remove any galaxies with missing magnitudes. To see how to calculate posteriors for galaxies with missing magnitudes, see Section 3.
Now let's draw a degraded sample:
# set up the error model
n_samples=20
# create the FlowEngine
flowEngine_degr = FlowCreator.make_stage(name='degraded', flow_file=flow_file, n_samples=n_samples)
# draw a few samples
samples_degr = flowEngine_degr.sample(20, seed=0)
errorModel = LSSTErrorModel.make_stage(name='lsst_errors', input='xx')
quantityCut = QuantityCut.make_stage(name='gold_cut', input='xx', cuts={band: np.inf for band in "ugrizy"})
inv_incomplete = InvRedshiftIncompleteness.make_stage(name='incompleteness', pivot_redshift=0.8)
OII = 3727
OIII = 5007
lc_2p_0II_0III = LineConfusion.make_stage(name='lc_2p_0II_0III',
true_wavelen=OII, wrong_wavelen=OIII, frac_wrong=0.02)
lc_1p_0III_0II = LineConfusion.make_stage(name='lc_1p_0III_0II',
true_wavelen=OIII, wrong_wavelen=OII, frac_wrong=0.01)
detection = QuantityCut.make_stage(name='detection', cuts={"i": 25.3})
data = samples_degr
for degr in [errorModel, quantityCut, inv_incomplete, lc_2p_0II_0III, lc_1p_0III_0II, detection]:
data = degr(data)
Inserting handle into data store. output_degraded: inprogress_output_degraded.pq, degraded Inserting handle into data store. output_lsst_errors: inprogress_output_lsst_errors.pq, lsst_errors Inserting handle into data store. output_gold_cut: inprogress_output_gold_cut.pq, gold_cut Inserting handle into data store. output_incompleteness: inprogress_output_incompleteness.pq, incompleteness Inserting handle into data store. output_lc_2p_0II_0III: inprogress_output_lc_2p_0II_0III.pq, lc_2p_0II_0III Inserting handle into data store. output_lc_1p_0III_0II: inprogress_output_lc_1p_0III_0II.pq, lc_1p_0III_0II Inserting handle into data store. output_detection: inprogress_output_detection.pq, detection
samples_degraded_wo_nondetects = data.data
samples_degraded_wo_nondetects
redshift | u | u_err | g | g_err | r | r_err | i | i_err | z | z_err | y | y_err | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0.330247 | 22.541234 | 0.010564 | 21.875726 | 0.005297 | 21.150399 | 0.005084 | 20.944225 | 0.005121 | 20.658345 | 0.005214 | 20.648385 | 0.005929 |
4 | 0.708621 | 26.203844 | 0.227412 | 25.790756 | 0.053454 | 25.035502 | 0.026957 | 24.285051 | 0.020691 | 24.123581 | 0.031339 | 24.021898 | 0.065234 |
7 | 0.804999 | 26.471887 | 0.283218 | 25.054919 | 0.027940 | 23.355602 | 0.007688 | 22.170113 | 0.005855 | 21.518421 | 0.005846 | 21.220781 | 0.007253 |
10 | 0.974927 | 26.613024 | 0.317200 | 25.659642 | 0.047591 | 24.771981 | 0.021466 | 24.109077 | 0.017840 | 23.492141 | 0.018192 | 23.204826 | 0.031632 |
12 | 0.626096 | 25.081285 | 0.086982 | 24.570988 | 0.018475 | 23.492089 | 0.008276 | 22.638918 | 0.006786 | 22.343938 | 0.007995 | 22.098511 | 0.012546 |
This sample has photometric errors that we would like to convolve in the redshift posteriors, so that the posteriors are fully consistent with the errors. We can perform this convolution by sampling from the error distributions, calculating posteriors, and averaging.
FlowEngine
has this functionality already built in - we just have to provide err_samples
to the get_posterior
method.
Let's calculate posteriors with a variable number of error samples.
grid = np.linspace(0, 2.5, 100)
def get_degr_post(key, data, **kwargs):
flow_degr_post = FlowPosterior.make_stage(name=f'degr_post_{key}', **kwargs)
return flow_degr_post.get_posterior(data, column='redshift')
degr_kwargs = dict(column='redshift', flow_file=flow_file,
marg_rules=dict(flag=np.nan, u=lambda row: np.linspace(25, 31, 10)),
grid=grid, seed=0, batch_size=2)
pdfs_errs_convolved = {
err_samples: get_degr_post(f'{str(err_samples)}', data,
err_samples=err_samples, **degr_kwargs)
for err_samples in [1, 10, 100, 1000]
}
Inserting handle into data store. output_degr_post_1: inprogress_output_degr_post_1.hdf5, degr_post_1 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data' Inserting handle into data store. output_degr_post_10: inprogress_output_degr_post_10.hdf5, degr_post_10 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data' Inserting handle into data store. output_degr_post_100: inprogress_output_degr_post_100.hdf5, degr_post_100 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data' Inserting handle into data store. output_degr_post_1000: inprogress_output_degr_post_1000.hdf5, degr_post_1000 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data'
fig, axes = plt.subplots(2, 3, dpi=120)
for i, ax in enumerate(axes.flatten()):
# set dummy values for xlim
xlim = [np.inf, -np.inf]
for pdfs_ in pdfs_errs_convolved.values():
# plot the pdf
pdfs_.data[i].plot_native(axes=ax)
# get the x value where the pdf first rises above 2
xmin = grid[np.argmax(pdfs_.data[i].pdf(grid)[0] > 2)]
if xmin < xlim[0]:
xlim[0] = xmin
# get the x value where the pdf finally falls below 2
xmax = grid[-np.argmax(pdfs_.data[i].pdf(grid)[0, ::-1] > 2)]
if xmax > xlim[1]:
xlim[1] = xmax
# plot the true redshift
#z_true = samples_degraded_wo_nondetects["redshift"][i]
#ax.axvline(z_true, c="k", ls="--")
# set x-label on bottom row
if i >= 3:
ax.set(xlabel="redshift")
# set y-label on far left column
if i % 3 == 0:
ax.set(ylabel="p(z)")
# set the x-limits so we can see more detail
xlim[0] -= 0.2
xlim[1] += 0.2
ax.set(xlim=xlim, yticks=[])
# create the legend
axes[0, 1].plot([], [], c="C0", label=f"1 sample")
for i, n in enumerate([10, 100, 1000]):
axes[0, 1].plot([], [], c=f"C{i+1}", label=f"{n} samples")
axes[0, 1].legend(
bbox_to_anchor=(0.5, 1.3),
loc="upper center",
ncol=4,
)
plt.show()
You can see the effect of convolving the errors. In particular, notice that without error convolution (1 sample), the redshift posterior is often totally inconsistent with the true redshift (marked by the vertical black line). As you convolve more samples, the posterior generally broadens and becomes consistent with the true redshift.
Also notice how the posterior continues to change as you convolve more and more samples. This suggests that you need to do a little testing to ensure that you have convolved enough samples.
Let's plot these same posteriors with even more samples to make sure they have converged:
WARNING: Running the next cell on your computer may exhaust your memory
pdfs_errs_convolved_more_samples = {
err_samples: get_degr_post(f'{str(err_samples)}', data,
err_samples=err_samples, **degr_kwargs)
# for err_samples in [1000, 2000, 5000, 10000]
for err_samples in [12]
}
Inserting handle into data store. output_degr_post_12: inprogress_output_degr_post_12.hdf5, degr_post_12 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data'
fig, axes = plt.subplots(2, 3, dpi=120)
for i, ax in enumerate(axes.flatten()):
# set dummy values for xlim
xlim = [np.inf, -np.inf]
for pdfs_ in pdfs_errs_convolved_more_samples.values():
# plot the pdf
pdfs_.data[i].plot_native(axes=ax)
# get the x value where the pdf first rises above 2
xmin = grid[np.argmax(pdfs_.data[i].pdf(grid)[0] > 2)]
if xmin < xlim[0]:
xlim[0] = xmin
# get the x value where the pdf finally falls below 2
xmax = grid[-np.argmax(pdfs_.data[i].pdf(grid)[0, ::-1] > 2)]
if xmax > xlim[1]:
xlim[1] = xmax
# plot the true redshift
#z_true = samples_degraded_wo_nondetects["redshift"][i]
#ax.axvline(z_true, c="k", ls="--")
# set x-label on bottom row
if i >= 3:
ax.set(xlabel="redshift")
# set y-label on far left column
if i % 3 == 0:
ax.set(ylabel="p(z)")
# set the x-limits so we can see more detail
xlim[0] -= 0.2
xlim[1] += 0.2
ax.set(xlim=xlim, yticks=[])
# create the legend
for i, n in enumerate([1000, 2000, 5000, 10000]):
axes[0, 1].plot([], [], c=f"C{i}", label=f"{n} samples")
axes[0, 1].legend(
bbox_to_anchor=(0.5, 1.3),
loc="upper center",
ncol=4,
)
plt.show()
Notice that two of these galaxies may take upwards of 10,000 samples to converge (convolving over 10,000 samples takes 0.5 seconds / galaxy on my computer)
3. Calculating posteriors with missing bands¶
Now let's finally tackle posterior calculation with missing bands.
First, lets make a sample that has missing bands. Let's use the same degrader as we used above, except without the final QuantityCut that removed non-detections:
samples_degraded = DS['output_lc_1p_0III_0II']
You can see that galaxy 3 has a non-detection in the u band. FlowEngine
can handle missing values by marginalizing over that value. By default, FlowEngine
will marginalize over NaNs in the u band, using the grid u = np.linspace(25, 31, 10)
. This default grid should work in most cases, but you may want to change the flag for non-detections, use a different grid for the u band, or marginalize over non-detections in other bands. In order to do these things, you must supply FlowEngine
with marginalization rules in the form of the marg_rules
dictionary.
Let's imagine we want to use a different grid for u band marginalization. In order to determine what grid to use, we will create a histogram of non-detections in u band vs true u band magnitude (assuming year 10 LSST errors). This will tell me what are reasonable values of u to marginalize over.
# get true u band magnitudes
true_u = DS['output_degraded'].data["u"].to_numpy()
# get the observed u band magnitudes
obs_u = DS['output_lsst_errors'].data["u"].to_numpy()
# create the figure
fig, ax = plt.subplots(constrained_layout=True, dpi=100)
# plot the u band detections
ax.hist(true_u[~np.isnan(obs_u)], bins="fd", label="detected")
# plot the u band non-detections
ax.hist(true_u[np.isnan(obs_u)], bins="fd", label="non-detected")
ax.legend()
ax.set(xlabel="true u magnitude")
plt.show()
Based on this histogram, I will marginalize over u band values from 27 to 31. Like how I tested different numbers of error samples above, here I will test different resolutions for the u band grid.
I will provide our new u band grid in the marg_rules
dictionary, which will also include "flag"
which tells FlowEngine
what my flag for non-detections is.
In this simple example, we are using a fixed grid for the u band, but notice that the u band rule takes the form of a function - this is because the grid over which to marginalize can be a function of any of the other variables in the row.
If I wanted to marginalize over any other bands, I would need to include corresponding rules in marg_rules
too.
For this example, I will only calculate pdfs for galaxy 3, which is the galaxy with a non-detection in the u band. Also, similarly to how I tested the error convolution with a variable number of samples, I will test the marginalization with varying resolutions for the marginalized grid.
from rail.core.utilStages import RowSelector
# dict to save the marginalized posteriors
pdfs_u_marginalized = {}
row3_selector = RowSelector.make_stage(name='select_row3', start=3, stop=4)
row3_degraded = row3_selector(samples_degraded)
degr_post_kwargs = dict(grid=grid, err_samples=10000, seed=0, flow_file=flow_file, column='redshift')
# iterate over variable grid resolution
for nbins in [10, 20, 50, 100]:
# set up the marginalization rules for this grid resolution
marg_rules = {
"flag": errorModel.config["ndFlag"],
"u": lambda row: np.linspace(27, 31, nbins)
}
# calculate the posterior by marginalizing over u and sampling
# from the error distributions of the other galaxies
pdfs_u_marginalized[nbins] = get_degr_post(f'degr_post_nbins_{nbins}',
row3_degraded,
marg_rules=marg_rules,
**degr_post_kwargs)
Inserting handle into data store. output_select_row3: inprogress_output_select_row3.pq, select_row3 Inserting handle into data store. output_degr_post_degr_post_nbins_10: inprogress_output_degr_post_degr_post_nbins_10.hdf5, degr_post_degr_post_nbins_10 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data' Inserting handle into data store. output_degr_post_degr_post_nbins_20: inprogress_output_degr_post_degr_post_nbins_20.hdf5, degr_post_degr_post_nbins_20 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data' Inserting handle into data store. output_degr_post_degr_post_nbins_50: inprogress_output_degr_post_degr_post_nbins_50.hdf5, degr_post_degr_post_nbins_50 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data' Inserting handle into data store. output_degr_post_degr_post_nbins_100: inprogress_output_degr_post_degr_post_nbins_100.hdf5, degr_post_degr_post_nbins_100 Warning. Failed to convert column 'ArrayImpl' object has no attribute 'data'
fig, ax = plt.subplots(dpi=100)
for i in [10, 20, 50, 100]:
pdfs_u_marginalized[i]()[0].plot_native(axes=ax, label=f"{i} bins")
ax.axvline(samples_degraded().iloc[3]["redshift"], label="True redshift", c="k")
ax.legend()
ax.set(xlabel="Redshift")
plt.show()
Notice that the resolution with only 10 bins is sufficient for this marginalization.
In this example, only one of the bands featured a non-detection, but you can easily marginalize over more bands by including corresponding rules in the marg_rules
dict. For example, let's artificially force a non-detection in the y band as well:
row3_degraded()
redshift | u | u_err | g | g_err | r | r_err | i | i_err | z | z_err | y | y_err | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0.330247 | 22.541234 | 0.010564 | 21.875726 | 0.005297 | 21.150399 | 0.005084 | 20.944225 | 0.005121 | 20.658345 | 0.005214 | 20.648385 | 0.005929 |
sample_double_degraded = row3_degraded().copy()
sample_double_degraded.iloc[0, 11:] *= np.nan
DS.add_data('double_degr', sample_double_degraded, TableHandle)
sample_double_degraded
redshift | u | u_err | g | g_err | r | r_err | i | i_err | z | z_err | y | y_err | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0.330247 | 22.541234 | 0.010564 | 21.875726 | 0.005297 | 21.150399 | 0.005084 | 20.944225 | 0.005121 | 20.658345 | 0.005214 | NaN | NaN |
# set up the marginalization rules for u and y marginalization
marg_rules = {
"flag": errorModel.config.ndFlag,
"u": lambda row: np.linspace(27, 31, 10),
"y": lambda row: np.linspace(21, 25, 10),
}
# calculate the posterior by marginalizing over u and y, and sampling
# from the error distributions of the other galaxies
#pdf_double_marginalized = get_degr_post('double_degr',
# DS['double_degr'],
# flow_file=flow_file,
# input='xx',
# column='redshift',
# grid=grid,
# err_samples=10000,
# seed=0,
# marg_rules=marg_rules)
#fig, ax = plt.subplots(dpi=100)
#pdf_double_marginalized.data[0].plot_native(axes=ax)
#ax.axvline(sample_double_degraded.iloc[0]["redshift"], label="True redshift", c="k")
#ax.legend()
#ax.set(xlabel="Redshift")
#plt.show()
Note that marginalizing over multiple bands quickly gets expensive