Creation Demo Notebooks

Using Engines and Degraders to Generate Galaxy Samples with Errors and Biases

degradation-demo

Using Engines and Degraders to Generate Galaxy Samples with Errors and Biases¶

author: John Franklin Crenshaw, Sam Schmidt, Eric Charles, others...

last run successfully: March 16, 2022

This notebook demonstrates how to use a RAIL Engines to create galaxy samples, and how to use Degraders to add various errors and biases to the sample.

Note that in the parlance of the Creation Module, "degradation" is any post-processing that occurs to the "true" sample generated by the Engine. This can include adding photometric errors, applying quality cuts, introducing systematic biases, etc.

In this notebook, we will first learn how to draw samples from a RAIL Engine object. Then we will demonstrate how to use the following RAIL Degraders:

LSSTErrorModel, which adds photometric errors
QuantityCut, which applies cuts to the specified columns of the sample
InvRedshiftIncompleteness, which introduces sample incompleteness
LineConfusion, which introduces spectroscopic errors

Throughout the notebook, we will show how you can chain all these Degraders together to build a more complicated degrader. Hopefully, this will allow you to see how you can build your own degrader.

Note on generating redshift posteriors: regardless of what Degraders you apply, when you use a Creator to estimate posteriors, the posteriors will always be calculated with respect to the "true" distribution. This is the whole point of the Creation Module -- you can generate degraded samples for which we still have access to the true posteriors. For an example of how to calculate posteriors, see posterior-demo.ipynb.

In [1]:

import matplotlib.pyplot as plt
from pzflow.examples import get_example_flow
from rail.creation.engines.flowEngine import FlowCreator
from rail.creation.degradation import (
    InvRedshiftIncompleteness,
    LineConfusion,
    LSSTErrorModel,
    QuantityCut,
)
from rail.core.stage import RailStage

Specify the path to the pretrained 'pzflow' used to generate samples¶

In [2]:

import pzflow
import os
flow_file = os.path.join(os.path.dirname(pzflow.__file__), 'example_files', 'example-flow.pzflow.pkl')

We'll start by setting up the Rail data store. RAIL uses ceci, which is designed for pipelines rather than interactive notebooks, the data store will work around that and enable us to use data interactively. See the rail/examples/goldenspike/goldenspike.ipynb example notebook for more details on the Data Store.

In [3]:

DS = RailStage.data_store
DS.__class__.allow_overwrite = True

"True" Engine¶

First, let's make an Engine that has no degradation. We can use it to generate a "true" sample, to which we can compare all the degraded samples below.

Note: in this example, we will use a normalizing flow engine from the pzflow package. However, everything in this notebook is totally agnostic to what the underlying engine is.

The Engine is a type of RailStage object, so we can make one using the RailStage.make_stage function for the class of Engine that we want. We then pass in the configuration parameters as arguments to make_stage.

In [4]:

n_samples = int(1e5)
flowCreator_truth = FlowCreator.make_stage(name='truth', model=flow_file, n_samples=n_samples)

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Inserting handle into data store.  model: /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pzflow/example_files/example-flow.pzflow.pkl, truth

Let's check that the Engine correctly read the underlying PZ Flow object¶

In [5]:

flowCreator_truth.get_data('model')

Out[5]:

<pzflow.flow.Flow at 0x7f1b0495e6a0>

Now we invoke the `sample` method to generate some samples¶

Note that this will return a DataHandle object, which can keep both the data itself, and also the path to where the data is written. When talking to rail stages we can use this as though it were the underlying data and pass it as an argument. This allows the rail stages to keep track of where their inputs are coming from.

In [6]:

samples_truth = flowCreator_truth.sample(n_samples, seed=0)
print(samples_truth())
print("Data was written to ", samples_truth.path)

Inserting handle into data store.  output_truth: inprogress_output_truth.pq, truth
       redshift          u          g          r          i          z  \
0      0.890625  27.370831  26.712662  26.025223  25.327188  25.016500   
1      1.978239  29.557049  28.361185  27.587231  27.238544  26.628109   
2      0.974287  26.566015  25.937716  24.787413  23.872456  23.139563   
3      1.317979  29.042730  28.274593  27.501106  26.648790  26.091450   
4      1.386366  26.292624  25.774778  25.429958  24.806530  24.367950   
...         ...        ...        ...        ...        ...        ...   
99995  2.147172  26.550978  26.349937  26.135286  26.082020  25.911032   
99996  1.457508  27.362207  27.036276  26.823139  26.420132  26.110037   
99997  1.372992  27.736044  27.271955  26.887581  26.416138  26.043434   
99998  0.855022  28.044552  27.327116  26.599014  25.862331  25.592169   
99999  1.723768  27.049067  26.526745  26.094595  25.642971  25.197956   

               y  
0      24.926821  
1      26.248560  
2      22.832047  
3      25.346500  
4      23.700010  
...          ...  
99995  25.558136  
99996  25.524904  
99997  25.456165  
99998  25.506388  
99999  24.900501  

[100000 rows x 7 columns]
Data was written to  output_truth.pq

Degrader 1: LSSTErrorModel¶

Now, we will demonstrate the LSSTErrorModel, which adds photometric errors using a model similar to the model from Ivezic et al. 2019 (specifically, it uses the model from this paper, without making the high SNR assumption. To restore this assumption and therefore use the exact model from the paper, set highSNR=True.)

Let's create an error model with the default settings:

In [7]:

errorModel = LSSTErrorModel.make_stage(name='error_model')

To see the details of the model, including the default settings we are using, you can just print the model:

In [8]:

errorModel

Out[8]:

LSSTErrorModel parameters:

Model for bands: u, g, r, i, z, y

Using error type point
Exposure time = 30.0 s
Number of years of observations = 10.0
Mean visits per year per band:
   u: 5.6, g: 8.0, r: 18.4, i: 18.4, z: 16.0, y: 16.0
Airmass = 1.2
Irreducible system error = 0.005
Magnitudes dimmer than 30.0 are set to nan
gamma for each band:
   u: 0.038, g: 0.039, r: 0.039, i: 0.039, z: 0.039, y: 0.039

The coadded 5-sigma limiting magnitudes are:
u: 26.04, g: 27.29, r: 27.31, i: 26.87, z: 26.23, y: 25.30

The following single-visit 5-sigma limiting magnitudes are
calculated using the parameters that follow them:
   u: 23.83, g: 24.90, r: 24.47, i: 24.03, z: 23.46, y: 22.53
Cm for each band:
   u: 23.09, g: 24.42, r: 24.44, i: 24.32, z: 24.16, y: 23.73
Median zenith sky brightness in each band:
   u: 22.99, g: 22.26, r: 21.2, i: 20.48, z: 19.6, y: 18.61
Median zenith seeing FWHM (in arcseconds) for each band:
   u: 0.81, g: 0.77, r: 0.73, i: 0.71, z: 0.69, y: 0.68
Extinction coefficient for each band:
   u: 0.491, g: 0.213, r: 0.126, i: 0.096, z: 0.069, y: 0.17

Now let's add this error model as a degrader and draw some samples with photometric errors.

In [9]:

samples_w_errs = errorModel(samples_truth)
samples_w_errs()

Inserting handle into data store.  output_error_model: inprogress_output_error_model.pq, error_model

Out[9]:

	redshift	u	u_err	g	g_err	r	r_err	i	i_err	z	z_err	y	y_err
0	0.890625	NaN	NaN	26.562721	0.105583	26.084861	0.068194	25.340978	0.052257	25.021891	0.069445	25.047443	0.159796
1	1.978239	NaN	NaN	28.038419	0.362520	27.490722	0.229680	28.102581	0.525461	26.066428	0.172483	25.834953	0.307316
2	0.974287	26.873697	0.389236	25.882633	0.057988	24.797719	0.021944	23.873355	0.014716	23.128763	0.013557	22.861474	0.023448
3	1.317979	27.914048	0.817339	27.705399	0.277971	27.204204	0.180633	26.703293	0.172092	25.931166	0.153677	25.795159	0.297649
4	1.386366	26.336934	0.253759	25.750773	0.051593	25.483414	0.039993	24.809233	0.032626	24.301733	0.036670	23.576059	0.043921
...	...	...	...	...	...	...	...	...	...	...	...	...	...
99995	2.147172	26.643909	0.325091	26.212954	0.077661	26.220695	0.076900	26.027656	0.095907	26.102146	0.177794	25.635739	0.261534
99996	1.457508	26.621966	0.319467	26.982388	0.151845	26.542811	0.102093	26.446734	0.138137	25.959232	0.157414	25.461991	0.226646
99997	1.372992	26.679523	0.334399	27.416936	0.219265	27.042587	0.157411	26.480484	0.142215	26.165722	0.187622	24.902178	0.141068
99998	0.855022	26.886674	0.393155	27.355825	0.208363	26.494891	0.097896	25.783669	0.077364	25.514723	0.107157	25.333237	0.203557
99999	1.723768	27.557109	0.643300	26.442709	0.095055	26.216528	0.076618	25.710465	0.072517	25.169914	0.079153	24.799610	0.129108

100000 rows × 13 columns

Notice some of the magnitudes are NaN's. These are non-detections. This means those observed magnitudes were beyond the 30mag limit that is default in LSSTErrorModel. You can change this limit and the corresponding flag by setting magLim=... and ndFlag=... in the constructor for LSSTErrorModel.

Let's plot the error as a function of magnitude

In [10]:

fig, ax = plt.subplots(figsize=(5, 4), dpi=100)

for band in "ugrizy":
    
    # pull out the magnitudes and errors
    mags = samples_w_errs.data[band].to_numpy()
    errs = samples_w_errs.data[band + "_err"].to_numpy()
    
    # sort them by magnitude
    mags, errs = mags[mags.argsort()], errs[mags.argsort()]
    
    # plot errs vs mags
    ax.plot(mags, errs, label=band) 
    
ax.legend()
ax.set(xlabel="Magnitude (AB)", ylabel="Error (mags)")
plt.show()

You can see that the photometric error increases as magnitude gets dimmer, just like you would expect. Notice, however, that we have galaxies as dim as magnitude 30. This is because the Flow produces a sample much deeper than the LSST 5-sigma limiting magnitudes. There are no galaxies dimmer than magnitude 30 because LSSTErrorModel sets magnitudes > 30 equal to NaN (the default flag for non-detections).

Degrader 2: QuantityCut¶

Recall how the sample above has galaxies as dim as magnitude 30. This is well beyond the LSST 5-sigma limiting magnitudes, so it will be useful to apply cuts to the data to filter out these super-dim samples. We can apply these cuts using the QuantityCut degrader. This degrader will cut out any samples that do not pass all of the specified cuts.

Let's make and run degraders that first adds photometric errors, then cuts at i<25.3, which is the LSST gold sample.

In [11]:

gold_cut = QuantityCut.make_stage(name='cuts', cuts={"i": 25.3})                                

Now we can stick this into a Creator and draw a new sample

In [12]:

samples_gold_w_errs = gold_cut(samples_w_errs)

Inserting handle into data store.  output_cuts: inprogress_output_cuts.pq, cuts

If you look at the i column, you will see there are no longer any samples with i > 25.3. The number of galaxies returned has been nearly cut in half from the input sample and, unlike the LSSTErrorModel degrader, is not equal to the number of input objects. Users should note that with degraders that remove galaxies from the sample the size of the output sample will not equal that of the input sample.

One more note: it is easy to use the QuantityCut degrader as a SNR cut on the magnitudes. The magnitude equation is $m = -2.5 \log(f)$. Taking the derivative, we have $$ dm = \frac{2.5}{\ln(10)} \frac{df}{f} = \frac{2.5}{\ln(10)} \frac{1}{\mathrm{SNR}}. $$ So if you want to make a cut on galaxies above a certain SNR, you can make a cut $$ dm < \frac{2.5}{\ln(10)} \frac{1}{\mathrm{SNR}}. $$ For example, an SNR cut on the i band would look like this: QuantityCut({"i_err": 2.5/np.log(10) * 1/SNR}).

Degrader 3: InvRedshiftIncompleteness¶

Next, we will demonstrate the InvRedshiftIncompleteness degrader. It applies a selection function, which keeps galaxies with probability $p_{\text{keep}}(z) = \min(1, \frac{z_p}{z})$, where $z_p$ is the ''pivot'' redshift. We'll use $z_p = 0.8$.

In [13]:

inv_incomplete = InvRedshiftIncompleteness.make_stage(name='incompleteness', pivot_redshift=0.8)

In [14]:

samples_incomplete_gold_w_errs = inv_incomplete(samples_gold_w_errs)

Inserting handle into data store.  output_incompleteness: inprogress_output_incompleteness.pq, incompleteness

Let's plot the redshift distributions of the samples we have generated so far:

In [15]:

fig, ax = plt.subplots(figsize=(5, 4), dpi=100)

zmin = 0
zmax = 2.5

hist_settings = {
    "bins": 50,
    "range": (zmin, zmax),
    "density": True,
    "histtype": "step",
}

ax.hist(samples_truth()["redshift"], label="Truth", **hist_settings)
ax.hist(samples_gold_w_errs()["redshift"], label="Gold", **hist_settings)
ax.hist(samples_incomplete_gold_w_errs()["redshift"], label="Incomplete Gold", **hist_settings)
ax.legend(title="Sample")
ax.set(xlim=(zmin, zmax), xlabel="Redshift", ylabel="Galaxy density")
plt.show()

You can see that the Gold sample has significantly fewer high-redshift galaxies than the truth. This is because many of the high-redshift galaxies have i > 25.3.

You can further see that the Incomplete Gold sample has even fewer high-redshift galaxies. This is exactly what we expected from this degrader.

Degrader 4: LineConfusion¶

LineConfusion is a degrader that simulates spectroscopic errors resulting from the confusion of different emission lines.

For this example, let's use the degrader to simulate a scenario in which which 2% of [OII] lines are mistaken as [OIII] lines, and 1% of [OIII] lines are mistaken as [OII] lines. (note I do not know how realistic this scenario is!)

In [16]:

OII = 3727
OIII = 5007

lc_2p_0II_0III = LineConfusion.make_stage(name='lc_2p_0II_0III',
                                          true_wavelen=OII, wrong_wavelen=OIII, frac_wrong=0.02)
lc_1p_0III_0II = LineConfusion.make_stage(name='lc_1p_0III_0II',
                                          true_wavelen=OIII, wrong_wavelen=OII, frac_wrong=0.01)

In [17]:

samples_conf_inc_gold_w_errs = lc_1p_0III_0II(lc_2p_0II_0III(samples_incomplete_gold_w_errs))

Inserting handle into data store.  output_lc_2p_0II_0III: inprogress_output_lc_2p_0II_0III.pq, lc_2p_0II_0III
Inserting handle into data store.  output_lc_1p_0III_0II: inprogress_output_lc_1p_0III_0II.pq, lc_1p_0III_0II

Let's plot the redshift distributions one more time

In [18]:

fig, ax = plt.subplots(figsize=(5, 4), dpi=100)

zmin = 0
zmax = 2.5

hist_settings = {
    "bins": 50,
    "range": (zmin, zmax),
    "density": True,
    "histtype": "step",
}

ax.hist(samples_truth()["redshift"], label="Truth", **hist_settings)
ax.hist(samples_gold_w_errs()["redshift"], label="Gold", **hist_settings)
ax.hist(samples_incomplete_gold_w_errs()["redshift"], label="Incomplete Gold", **hist_settings)
ax.hist(samples_conf_inc_gold_w_errs()["redshift"], label="Confused Incomplete Gold", **hist_settings)
ax.legend(title="Sample")
ax.set(xlim=(zmin, zmax), xlabel="Redshift", ylabel="Galaxy density")
plt.show()

You can see that the redshift distribution of this new sample is essentially identical to the Incomplete Gold sample, with small perturbations that result from the line confusion.

However the real impact of this degrader isn't on the redshift distribution, but rather that it introduces erroneous spec-z's into the photo-z training sets! To see the impact of this effect, let's plot the true spec-z's as present in the Incomplete Gold sample, vs the spec-z's listed in the new sample with Oxygen Line Confusion.

In [19]:

fig, ax = plt.subplots(figsize=(6, 6), dpi=85)

ax.scatter(samples_incomplete_gold_w_errs()["redshift"], samples_conf_inc_gold_w_errs()["redshift"], 
           marker=".", s=1)

ax.set(
    xlim=(0, 2.5), ylim=(0, 2.5),
    xlabel="True spec-z (in Incomplete Gold sample)",
    ylabel="Spec-z listed in the Confused sample",
)
plt.show()

Now we can clearly see the spec-z errors! The galaxies above the line y=x are the [OII] -> [OIII] galaxies, while the ones below are the [OIII] -> [OII] galaxies.

Grid Selection Degrader to Emulate HSC Training Samples

example_GridSelection_for_HSC.ipynb

example_GridSelection_for_HSC

GridSelection degrader to emulate HSC training samples¶

The GridSelection degrader can be used to model the spectroscopic success rates in training sets based on real data. Given a 2-dimensional grid of spec-z success ratio as a function of two variables (often magnitude or color), the degrader will draw the appropriate fraction of samples from the input data and return a sample with incompleteness modeled. An additional redshift cut can also be applied, where all redshifts above the cutoff are also removed from the sample.

The degrader takes the following arguments:

ratio_file: the name of the file containing the 2-dimensional grid of spec-z success
random_seed: random seed to feed to numpy for reproducibility
settings_file: path to the pickled file containing settings that define the 2-dimensional grid. There is a mechanism to make cuts either on a single column from the input data, or a difference (i.e. either a magnitude or a color as a difference of two magnitudes). The parameters in the settings file are:
- x_band_1: column name for the x-axis variable from ratios grid.
- x_band_2: [optional] column name for the second x-axis variable. If x_band_2 is set to '' then it is assumed that the x-axis is parameterized in terms of x_band_1. If x_band_2 is not '' then the x-axis is compared against (x_band_1 - x_band_2)
-y_band_1 and y_band_2: analagous to x_band_1 and x_band_2 but for the y-axis
- x_limits and y_limits: 2-element lists with the minimum and maximum values for the grid, e.g. [13, 26] if the limits in the x-axis are between magnitudes of 13 and 26.

In this quick notebook we'll create a grid of mock galaxies on the same grid on which the HyperSuprimeCam Survey (HSC) spectroscopic success has been parameterized by Irene Moskowitz (and available in RAIL/examples/creation/data), and plot the success rate to visualize the spectroscopic success rate for HSC.

In [1]:

import rail
import os
import matplotlib.pyplot as plt
import numpy as np
import tables_io
import pandas as pd
#from rail.core.data import TableHandle
from rail.core.stage import RailStage
%matplotlib inline 

In [2]:

DS = RailStage.data_store
DS.__class__.allow_overwrite = True

Let's make a grid of fake data matching the grid used by HSC. The 2D grid of spec-z success in this case is parameterized in terms of g-z color vs i-band magnitude, with g-z between [-2, 6] and i-band magnitude spanning [13, 26]. Let's generate 100 identical objects in each of the 100x100=10,000 grid cells, for a total of 1,000,000 mock galaxies. The only quantities that we need are g, i, z magnitudes and a redshift that we can just set to a random number between 1 and 2.5. The only thing that we really need is consistent g-z and i-mag values, so we can just set g to 20.0 in all circumstances.

In [3]:

gridgz, gridi = np.meshgrid(np.linspace(-1.98, 5.98, 100), np.linspace(13.0325, 25.9675, 100))

In [4]:

i = gridi.flatten()
gz = gridgz.flatten()
g = np.full_like(i, 20.0, dtype=np.double)
z = g - gz
redshift = np.round(np.random.uniform(size=len(i))*2.5, 2)

In [5]:

mockdict = {}
for label, item in zip(['i', 'gz', 'g', 'z', 'redshift'], [i, gz, g, z, redshift]):
    mockdict[f'{label}'] = np.repeat(item, 100).flatten()

In [6]:

df = pd.DataFrame(mockdict)

In [7]:

df.head()

Out[7]:

	i	gz	g	z	redshift
0	13.0325	-1.98	20.0	21.98	1.13
1	13.0325	-1.98	20.0	21.98	1.13
2	13.0325	-1.98	20.0	21.98	1.13
3	13.0325	-1.98	20.0	21.98	1.13
4	13.0325	-1.98	20.0	21.98	1.13

Now, let's import the GridSelection degrader and set up the config dict parameters. We will set a redshift cut of 5.1 so as to not cut any of our mock galaxies, if you would want a redshift cut, you would simply change this parameter as desired. There is an optional color_redshift_cut that scales the number of galaxies kept, we will turn this off. There is also a percentile_cut that computes percentiles in redshift of each cell and removes the highest redshift galaxies, as those are usually the most likely to not be recovered by a spectroscopic survey. For simplicity, we will set percentile_cut to 100. to not remove any galaxies with this cut.

The ratio file for HSC is located in the RAIL/examples/creation/data directory, as we are in RAIL/examples/creation folder with this demo the paths for the ratio_file and settings_file are set accordingly.
We will set a random seed for reproducibility, and set the output file to write our incomplete catalog to "test_hsc.pq".

In [8]:

from rail.creation.degradation.grid_selection import GridSelection

In [9]:

configdict = dict(redshift_cut=5.1, ratio_file='../../src/rail/examples/creation/data/hsc_ratios_and_specz.hdf5',
                  settings_file='../../src/rail/examples/creation/data/HSC_grid_settings.pkl', percentile_cut=100.,
                  color_redshift_cut=False,
                  output="test_hsc.pq", random_seed=66)

In [10]:

hsc_selecter = GridSelection.make_stage(name='hsc_cutter', **configdict)

Let's run the code and see how long it takes:

In [11]:

%%time
trim_data = hsc_selecter(df)

Inserting handle into data store.  input: None, hsc_cutter
Inserting handle into data store.  output_hsc_cutter: inprogress_test_hsc.pq, hsc_cutter
CPU times: user 4.51 s, sys: 357 ms, total: 4.86 s
Wall time: 4.87 s

This took 10.1s on my home computer, not too bad for 4 million mock galaxies

In [12]:

trim_data().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 181751 entries, 343119 to 924109
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   i          181751 non-null  float64
 1   gz         181751 non-null  float64
 2   g          181751 non-null  float64
 3   z          181751 non-null  float64
 4   redshift   181751 non-null  float64
 5   max_specz  181751 non-null  float64
dtypes: float64(6)
memory usage: 9.7 MB

And we see that we've kept 625,677 out of the 4,000,000 galaxies in the initial sample, so about 15% of the initial sample. To visualize our cuts, let's read in the success ratios file and plot our sample overlaid with an alpha of 0.01, that way the strength of the black dot will give a visual indication of how many galaxies in each cell we've kept.

In [13]:

# compare to sum of ratios * 100
ratio_file='../../src/rail/examples/creation/data/hsc_ratios_and_specz.hdf5'

In [14]:

allrats = tables_io.read(ratio_file)['ratios']

In [15]:

trim_data()['color'] = trim_data()['g'] - trim_data()['z']

In [16]:

bgr, bi = np.meshgrid(np.linspace(-2, 6, 101), np.linspace(13, 26, 101))

In [17]:

bratios = tables_io.read("../../src/rail/examples/creation/data/hsc_ratios_and_specz.hdf5")['ratios']

In [18]:

plt.figure(figsize=(18,18))
plt.pcolormesh(bi, bgr, bratios.T, cmap='jet', vmin=0, vmax=1,alpha=0.4)
plt.scatter(trim_data()['i'], trim_data()['color'], s=3, c='k',alpha =.015)
plt.xlabel("i-band mag", fontsize=18)
plt.ylabel("g-i", fontsize=18)

Out[18]:

Text(0, 0.5, 'g-i')

The colormap shows the HSC ratios and the strenth of the black dots shows how many galaxies were actually kept. We see perfect agreement between our predicted ratios and the actual number of galaxies kept, the degrader is functioning properly, and we see a nice visual representation of the resulting spectroscopic sample incompleteness.

In [19]:

# as our demo data is just a grid of mock points, let's remove it so we do not leave a large file sitting around
os.remove("./test_hsc.pq")

Spectroscopic Selection Degrader to Emulate zCOSMOS Training Samples

example_SpecSelection_for_zCOSMOS.ipynb

example_SpecSelection_for_zCOSMOS

Spectroscopic selection degrader to emulate zCOSMOS training samples¶

The spectroscopic_selection degrader can be used to model the spectroscopic success rates in training sets based on real data. Given a 2-dimensional grid of spec-z success ratio as a function of two variables (often magnitude, color, or redshift), the degrader will draw the appropriate fraction of samples from the input data and return a sample with incompleteness modeled.

The degrader takes the following arguments:

N_tot: number of selected sources
nondetect_val: non detected magnitude value to be excluded (usually 99.0, -99.0 or NaN).
downsample: If true, downsample the selected sources into a total number of N_tot.
success_rate_dir: The path to the directory containing success rate files.
colnames: a dictionary that includes necessary columns (magnitudes, colors and redshift) for selection. For magnitudes, the keys are ugrizy; for colors, the keys are, for example, gr standing for g-r; for redshift, the key is 'redshift'. In this demo, zCOSMOS takes {'i':'i', 'redshift':'redshift'} as minimum necessary input

In this quick notebook we'll select galaxies based on zCOSMOS selection function.

In [1]:

import rail
import os
import matplotlib.pyplot as plt
import numpy as np
import tables_io
import pandas as pd
#from rail.core.data import TableHandle
from rail.core.stage import RailStage
%matplotlib inline 

In [2]:

DS = RailStage.data_store
DS.__class__.allow_overwrite = True

Let's make fake data for zCOSMOS selection.

In [3]:

i = np.random.uniform(low=18, high=25.9675, size=(2000000,))
gz = np.random.uniform(low=-1.98, high=5.98, size=(2000000,))
u = np.full_like(i, 20.0, dtype=np.double)
g = np.full_like(i, 20.0, dtype=np.double)
r = np.full_like(i, 20.0, dtype=np.double)
y = np.full_like(i, 20.0, dtype=np.double)
z = g - gz
redshift = np.random.uniform(size=len(i)) * 2

standardize the column names

In [4]:

mockdict = {}
for label, item in zip(['u', 'g','r','i', 'z','y', 'redshift'], [u,g,r,i,z,y, redshift]):
    mockdict[f'{label}'] = item

np.repeat(item, 100).flatten()

In [5]:

df = pd.DataFrame(mockdict)

In [6]:

df.head()

Out[6]:

	u	g	r	i	z	y	redshift
0	20.0	20.0	20.0	21.608052	17.256106	20.0	0.601622
1	20.0	20.0	20.0	18.035623	14.795550	20.0	0.367423
2	20.0	20.0	20.0	20.508019	16.842229	20.0	0.125606
3	20.0	20.0	20.0	22.863798	14.556936	20.0	0.169077
4	20.0	20.0	20.0	25.918845	21.892444	20.0	1.602153

Now, let's import the spectroscopic_selections degrader for zCOSMOS.
The ratio file for zCOSMOS is located in the RAIL/src/rail/examples/creation/data/success_rate_data/ directory, as we are in RAIL/examples/creation folder named zCOSMOS_success.txt; the binning in i band and redshift are given in zCOSMOS_I_sampling.txt and zCOSMOS_z_sampling.txt.
We will set a random seed for reproducibility, and set the output file to write our incomplete catalog to "test_hsc.pq".

In [7]:

import sys
from rail.creation.degradation import spectroscopic_selections
from importlib import reload
from rail.creation.degradation.spectroscopic_selections import SpecSelection_zCOSMOS

In [8]:

zcosmos_selecter = SpecSelection_zCOSMOS.make_stage(downsample=False, 
                                                    colnames={'i':'i','redshift':'redshift'})

Let's run the code and see how long it takes:

In [9]:

%%time
trim_data = zcosmos_selecter(df)

Inserting handle into data store.  input: None, specselection_zCOSMOS
Inserting handle into data store.  output: inprogress_output.pq, specselection_zCOSMOS
CPU times: user 2.26 s, sys: 119 ms, total: 2.38 s
Wall time: 2.38 s

In [10]:

trim_data.data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 504375 entries, 0 to 1999999
Data columns (total 7 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   u         504375 non-null  float64
 1   g         504375 non-null  float64
 2   r         504375 non-null  float64
 3   i         504375 non-null  float64
 4   z         504375 non-null  float64
 5   y         504375 non-null  float64
 6   redshift  504375 non-null  float64
dtypes: float64(7)
memory usage: 30.8 MB

And we see that we've kept 503967 out of the 2,000,000 galaxies in the initial sample, so about 25% of the initial sample. To visualize our cuts, let's read in the success ratios file and plot our sample overlaid with an alpha of 0.05, that way the strength of the black dot will give a visual indication of how many galaxies in each cell we've kept.

In [11]:

# compare to sum of ratios * 100
ratio_file='../../src/rail/examples/creation/data/success_rate_data/zCOSMOS_success.txt'

In [12]:

ratios = np.loadtxt(ratio_file)

In [13]:

ibin_ = np.arange(18, 22.4, 0.01464226, dtype=np.float64)
zbin_ = np.arange(0, 1.4, 0.00587002, dtype=np.float64)

ibin, zbin = np.meshgrid(ibin_, zbin_)

In [14]:

trim_data.data

Out[14]:

	u	g	r	i	z	y	redshift
0	20.0	20.0	20.0	21.608052	17.256106	20.0	0.601622
1	20.0	20.0	20.0	18.035623	14.795550	20.0	0.367423
2	20.0	20.0	20.0	20.508019	16.842229	20.0	0.125606
5	20.0	20.0	20.0	22.155690	14.762925	20.0	1.201553
7	20.0	20.0	20.0	18.417438	20.662582	20.0	0.401905
...	...	...	...	...	...	...	...
1999978	20.0	20.0	20.0	19.503473	21.386000	20.0	0.293330
1999980	20.0	20.0	20.0	21.321786	16.921940	20.0	1.232539
1999989	20.0	20.0	20.0	20.665312	14.531765	20.0	1.011998
1999993	20.0	20.0	20.0	22.361941	16.112841	20.0	0.418955
1999999	20.0	20.0	20.0	18.452306	15.162175	20.0	0.264633

504375 rows × 7 columns

In [15]:

plt.figure(figsize=(12,12))
plt.title('zCOSMOS', fontsize=20)

c = plt.pcolormesh(zbin, ibin, ratios.T, cmap='turbo',vmin=0, vmax=1, alpha=0.8)
plt.scatter(trim_data.data['redshift'], trim_data.data['i'], s=2, c='k',alpha =.05)
plt.xlabel("redshift", fontsize=15)
plt.ylabel("i band Magnitude", fontsize=18)
cb = plt.colorbar(c, label='success rate',orientation='horizontal', pad=0.1)
cb.set_label(label='success rate', size=15)

The colormap shows the zCOSMOS success ratios and the strenth of the black dots shows how many galaxies were actually kept. We see perfect agreement between our predicted ratios and the actual number of galaxies kept, the degrader is functioning properly, and we see a nice visual representation of the resulting spectroscopic sample incompleteness.

Using a Creator to Calculate True Posteriors for a Galaxy Sample

posterior-demo.ipynb

posterior-demo

Using a Creator to Calculate True Posteriors for a Galaxy Sample¶

author: John Franklin Crenshaw, Sam Schmidt, Eric Charles, others...

last run successfully: March 16, 2022

This notebook demonstrates how to use a RAIL Engine to calculate true posteriors for galaxy samples drawn from the same Engine. Note that this notebook assumes you have already read through degradation-demo.ipynb.

Calculating posteriors is more complicated than drawing samples, because it requires more knowledge of the engine that underlies the Engine. In this example, we will use the same engine we used in Degradation demo: FlowEngine which wraps a normalizing flow from the pzflow package.

This notebook will cover three scenarios of increasing complexity:

Calculating posteriors without errors
Calculating posteriors while convolving errors
Calculating posteriors with missing bands

In [1]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pzflow.examples import get_example_flow
from rail.creation.engines.flowEngine import FlowCreator, FlowPosterior
from rail.creation.degradation import (
    InvRedshiftIncompleteness,
    LineConfusion,
    LSSTErrorModel,
    QuantityCut,
)
from rail.core.data import TableHandle
from rail.core.stage import RailStage
from rail.core.utilStages import ColumnMapper

In [2]:

import pzflow
import os
flow_file = os.path.join(os.path.dirname(pzflow.__file__), 'example_files', 'example-flow.pzflow.pkl')

We'll start by setting up the Rail data store. RAIL uses ceci, which is designed for pipelines rather than interactive notebooks, the data store will work around that and enable us to use data interactively. See the rail/examples/goldenspike/goldenspike.ipynb example notebook for more details on the Data Store.

In [3]:

DS = RailStage.data_store
DS.__class__.allow_overwrite = True

1. Calculating posteriors without errors¶

For a basic first example, let's make a Creator with no degradation and draw a sample.

Note that the FlowEngine.sample method is handing back a DataHandle. When talking to rail stages we can use this as though it were the underlying data and pass it as an argument. This allows the rail stages to keep track of where their inputs are coming from.

In [4]:

n_samples=6
# create the FlowCreator
flowCreator = FlowCreator.make_stage(name='truth', model=flow_file, n_samples=n_samples)
# draw a few samples
samples_truth = flowCreator.sample(6, seed=0)

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Inserting handle into data store.  model: /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pzflow/example_files/example-flow.pzflow.pkl, truth
Inserting handle into data store.  output_truth: inprogress_output_truth.pq, truth

Now, let's calculate true posteriors for this sample. Note the important fact here: these are literally the true posteriors for the sample because pzflow gives us direct access to the probability distribution from which the sample was drawn!

When calculating posteriors, the Engine will always require data, which is a pandas DataFrame of the galaxies for which we are calculating posteriors (in out case the samples_truth). Because we are using a FlowEngine, we also must provide grid, because FlowEngine calculates posteriors over a grid of redshift values.

Let's calculate posteriors for every galaxy in our sample:

In [5]:

flow_post = FlowPosterior.make_stage(name='truth_post', 
                                             column='redshift',
                                             grid = np.linspace(0, 2.5, 100),
                                             marg_rules=dict(flag=np.nan, 
                                                             u=lambda row: np.linspace(25, 31, 10)),
                                             flow=flow_file)

In [6]:

pdfs = flow_post.get_posterior(samples_truth, column='redshift')

Inserting handle into data store.  output_truth_post: inprogress_output_truth_post.hdf5, truth_post
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'

Note that Creator returns the pdfs as a qp Ensemble:

In [7]:

pdfs.data

Out[7]:

<qp.ensemble.Ensemble at 0x7f4a8800cfd0>

Let's plot these pdfs:

In [8]:

fig, axes = plt.subplots(2, 3, constrained_layout=True, dpi=120)

for i, ax in enumerate(axes.flatten()):

    # plot the pdf
    pdfs.data[i].plot_native(axes=ax)

    # plot the true redshift
    ax.axvline(samples_truth.data["redshift"][i], c="k", ls="--")

    # remove x-ticks on top row
    if i < 3:
        ax.set(xticks=[])
    # set x-label on bottom row
    else:
        ax.set(xlabel="redshift")
    # set y-label on far left column
    if i % 3 == 0:
        ax.set(ylabel="p(z)")

The true posteriors are in blue, and the true redshifts are marked by the vertical black lines.

2. Calculating posteriors while convolving errors¶

Now, let's get a little more sophisticated.

Let's recreate the Engine/Degredation we were using at the end of the Degradation demo.

I will make one change however: the LSST Error Model sometimes results in non-detections for faint galaxies. These non-detections are flagged with NaN. Calculating posteriors for galaxies with missing magnitudes is more complicated, so for now, I will add one additional QuantityCut to remove any galaxies with missing magnitudes. To see how to calculate posteriors for galaxies with missing magnitudes, see Section 3.

Now let's draw a degraded sample:

In [9]:

# set up the error model

n_samples=20
# create the FlowEngine
flowEngine_degr = FlowCreator.make_stage(name='degraded', flow_file=flow_file, n_samples=n_samples)
# draw a few samples
samples_degr = flowEngine_degr.sample(20, seed=0)
errorModel = LSSTErrorModel.make_stage(name='lsst_errors', input='xx')
quantityCut = QuantityCut.make_stage(name='gold_cut',  input='xx', cuts={band: np.inf for band in "ugrizy"})
inv_incomplete = InvRedshiftIncompleteness.make_stage(name='incompleteness', pivot_redshift=0.8)

OII = 3727
OIII = 5007

lc_2p_0II_0III = LineConfusion.make_stage(name='lc_2p_0II_0III', 
                                          true_wavelen=OII, wrong_wavelen=OIII, frac_wrong=0.02)
lc_1p_0III_0II = LineConfusion.make_stage(name='lc_1p_0III_0II',
                                          true_wavelen=OIII, wrong_wavelen=OII, frac_wrong=0.01)
detection = QuantityCut.make_stage(name='detection', cuts={"i": 25.3})

data = samples_degr
for degr in [errorModel, quantityCut, inv_incomplete, lc_2p_0II_0III, lc_1p_0III_0II, detection]:
    data = degr(data)

Inserting handle into data store.  output_degraded: inprogress_output_degraded.pq, degraded
Inserting handle into data store.  output_lsst_errors: inprogress_output_lsst_errors.pq, lsst_errors
Inserting handle into data store.  output_gold_cut: inprogress_output_gold_cut.pq, gold_cut
Inserting handle into data store.  output_incompleteness: inprogress_output_incompleteness.pq, incompleteness
Inserting handle into data store.  output_lc_2p_0II_0III: inprogress_output_lc_2p_0II_0III.pq, lc_2p_0II_0III
Inserting handle into data store.  output_lc_1p_0III_0II: inprogress_output_lc_1p_0III_0II.pq, lc_1p_0III_0II
Inserting handle into data store.  output_detection: inprogress_output_detection.pq, detection

In [10]:

samples_degraded_wo_nondetects = data.data
samples_degraded_wo_nondetects

Out[10]:

	redshift	u	u_err	g	g_err	r	r_err	i	i_err	z	z_err	y	y_err
3	0.330247	22.541234	0.010564	21.875726	0.005297	21.150399	0.005084	20.944225	0.005121	20.658345	0.005214	20.648385	0.005929
4	0.708621	26.203844	0.227412	25.790756	0.053454	25.035502	0.026957	24.285051	0.020691	24.123581	0.031339	24.021898	0.065234
7	0.804999	26.471887	0.283218	25.054919	0.027940	23.355602	0.007688	22.170113	0.005855	21.518421	0.005846	21.220781	0.007253
10	0.974927	26.613024	0.317200	25.659642	0.047591	24.771981	0.021466	24.109077	0.017840	23.492141	0.018192	23.204826	0.031632
12	0.626096	25.081285	0.086982	24.570988	0.018475	23.492089	0.008276	22.638918	0.006786	22.343938	0.007995	22.098511	0.012546

This sample has photometric errors that we would like to convolve in the redshift posteriors, so that the posteriors are fully consistent with the errors. We can perform this convolution by sampling from the error distributions, calculating posteriors, and averaging.

FlowEngine has this functionality already built in - we just have to provide err_samples to the get_posterior method.

Let's calculate posteriors with a variable number of error samples.

In [11]:

grid = np.linspace(0, 2.5, 100)
def get_degr_post(key, data, **kwargs):
    flow_degr_post = FlowPosterior.make_stage(name=f'degr_post_{key}', **kwargs) 
    return flow_degr_post.get_posterior(data, column='redshift')

In [12]:

degr_kwargs = dict(column='redshift', flow_file=flow_file, 
                   marg_rules=dict(flag=np.nan, u=lambda row: np.linspace(25, 31, 10)),
                   grid=grid, seed=0, batch_size=2)
pdfs_errs_convolved = {
    err_samples: get_degr_post(f'{str(err_samples)}', data,
                               err_samples=err_samples, **degr_kwargs)
    for err_samples in [1, 10, 100, 1000]
}

Inserting handle into data store.  output_degr_post_1: inprogress_output_degr_post_1.hdf5, degr_post_1
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'
Inserting handle into data store.  output_degr_post_10: inprogress_output_degr_post_10.hdf5, degr_post_10
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'
Inserting handle into data store.  output_degr_post_100: inprogress_output_degr_post_100.hdf5, degr_post_100
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'
Inserting handle into data store.  output_degr_post_1000: inprogress_output_degr_post_1000.hdf5, degr_post_1000
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'

In [13]:

fig, axes = plt.subplots(2, 3, dpi=120)

for i, ax in enumerate(axes.flatten()):

    # set dummy values for xlim
    xlim = [np.inf, -np.inf]

    for pdfs_ in pdfs_errs_convolved.values():

        # plot the pdf
        pdfs_.data[i].plot_native(axes=ax)

        # get the x value where the pdf first rises above 2
        xmin = grid[np.argmax(pdfs_.data[i].pdf(grid)[0] > 2)]
        if xmin < xlim[0]:
            xlim[0] = xmin
            
        # get the x value where the pdf finally falls below 2   
        xmax = grid[-np.argmax(pdfs_.data[i].pdf(grid)[0, ::-1] > 2)]
        if xmax > xlim[1]:
            xlim[1] = xmax

    # plot the true redshift
    #z_true = samples_degraded_wo_nondetects["redshift"][i]
    #ax.axvline(z_true, c="k", ls="--")

    # set x-label on bottom row
    if i >= 3:
        ax.set(xlabel="redshift")
    # set y-label on far left column
    if i % 3 == 0:
        ax.set(ylabel="p(z)")

    # set the x-limits so we can see more detail
    xlim[0] -= 0.2
    xlim[1] += 0.2
    ax.set(xlim=xlim, yticks=[])

# create the legend
axes[0, 1].plot([], [], c="C0", label=f"1 sample")
for i, n in enumerate([10, 100, 1000]):
    axes[0, 1].plot([], [], c=f"C{i+1}", label=f"{n} samples")
axes[0, 1].legend(
    bbox_to_anchor=(0.5, 1.3), 
    loc="upper center",
    ncol=4,
)

plt.show()

You can see the effect of convolving the errors. In particular, notice that without error convolution (1 sample), the redshift posterior is often totally inconsistent with the true redshift (marked by the vertical black line). As you convolve more samples, the posterior generally broadens and becomes consistent with the true redshift.

Also notice how the posterior continues to change as you convolve more and more samples. This suggests that you need to do a little testing to ensure that you have convolved enough samples.

Let's plot these same posteriors with even more samples to make sure they have converged:

WARNING: Running the next cell on your computer may exhaust your memory

In [14]:

pdfs_errs_convolved_more_samples = {
    err_samples: get_degr_post(f'{str(err_samples)}', data,
                               err_samples=err_samples, **degr_kwargs)
#    for err_samples in [1000, 2000, 5000, 10000]
    for err_samples in [12]
}

Inserting handle into data store.  output_degr_post_12: inprogress_output_degr_post_12.hdf5, degr_post_12
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'

In [15]:

fig, axes = plt.subplots(2, 3, dpi=120)

for i, ax in enumerate(axes.flatten()):

    # set dummy values for xlim
    xlim = [np.inf, -np.inf]

    for pdfs_ in pdfs_errs_convolved_more_samples.values():

        # plot the pdf
        pdfs_.data[i].plot_native(axes=ax)

        # get the x value where the pdf first rises above 2
        xmin = grid[np.argmax(pdfs_.data[i].pdf(grid)[0] > 2)]
        if xmin < xlim[0]:
            xlim[0] = xmin
            
        # get the x value where the pdf finally falls below 2
        xmax = grid[-np.argmax(pdfs_.data[i].pdf(grid)[0, ::-1] > 2)]
        if xmax > xlim[1]:
            xlim[1] = xmax

    # plot the true redshift
    #z_true = samples_degraded_wo_nondetects["redshift"][i]
    #ax.axvline(z_true, c="k", ls="--")

    # set x-label on bottom row
    if i >= 3:
        ax.set(xlabel="redshift")
    # set y-label on far left column
    if i % 3 == 0:
        ax.set(ylabel="p(z)")

    # set the x-limits so we can see more detail
    xlim[0] -= 0.2
    xlim[1] += 0.2
    ax.set(xlim=xlim, yticks=[])

# create the legend
for i, n in enumerate([1000, 2000, 5000, 10000]):
    axes[0, 1].plot([], [], c=f"C{i}", label=f"{n} samples")
axes[0, 1].legend(
    bbox_to_anchor=(0.5, 1.3), 
    loc="upper center",
    ncol=4,
)

plt.show()

Notice that two of these galaxies may take upwards of 10,000 samples to converge (convolving over 10,000 samples takes 0.5 seconds / galaxy on my computer)

3. Calculating posteriors with missing bands¶

Now let's finally tackle posterior calculation with missing bands.

First, lets make a sample that has missing bands. Let's use the same degrader as we used above, except without the final QuantityCut that removed non-detections:

In [16]:

samples_degraded = DS['output_lc_1p_0III_0II']

You can see that galaxy 3 has a non-detection in the u band. FlowEngine can handle missing values by marginalizing over that value. By default, FlowEngine will marginalize over NaNs in the u band, using the grid u = np.linspace(25, 31, 10). This default grid should work in most cases, but you may want to change the flag for non-detections, use a different grid for the u band, or marginalize over non-detections in other bands. In order to do these things, you must supply FlowEngine with marginalization rules in the form of the marg_rules dictionary.

Let's imagine we want to use a different grid for u band marginalization. In order to determine what grid to use, we will create a histogram of non-detections in u band vs true u band magnitude (assuming year 10 LSST errors). This will tell me what are reasonable values of u to marginalize over.

In [17]:

# get true u band magnitudes
true_u = DS['output_degraded'].data["u"].to_numpy()
# get the observed u band magnitudes
obs_u = DS['output_lsst_errors'].data["u"].to_numpy()

# create the figure
fig, ax = plt.subplots(constrained_layout=True, dpi=100)
# plot the u band detections
ax.hist(true_u[~np.isnan(obs_u)], bins="fd", label="detected")
# plot the u band non-detections
ax.hist(true_u[np.isnan(obs_u)], bins="fd", label="non-detected")

ax.legend()
ax.set(xlabel="true u magnitude")

plt.show()

Based on this histogram, I will marginalize over u band values from 27 to 31. Like how I tested different numbers of error samples above, here I will test different resolutions for the u band grid.

I will provide our new u band grid in the marg_rules dictionary, which will also include "flag" which tells FlowEngine what my flag for non-detections is. In this simple example, we are using a fixed grid for the u band, but notice that the u band rule takes the form of a function - this is because the grid over which to marginalize can be a function of any of the other variables in the row. If I wanted to marginalize over any other bands, I would need to include corresponding rules in marg_rules too.

For this example, I will only calculate pdfs for galaxy 3, which is the galaxy with a non-detection in the u band. Also, similarly to how I tested the error convolution with a variable number of samples, I will test the marginalization with varying resolutions for the marginalized grid.

In [18]:

from rail.core.utilStages import RowSelector
# dict to save the marginalized posteriors
pdfs_u_marginalized = {}

row3_selector = RowSelector.make_stage(name='select_row3', start=3, stop=4)
row3_degraded = row3_selector(samples_degraded)

degr_post_kwargs = dict(grid=grid, err_samples=10000, seed=0, flow_file=flow_file, column='redshift')

# iterate over variable grid resolution
for nbins in [10, 20, 50, 100]:

    # set up the marginalization rules for this grid resolution
    marg_rules = {
        "flag": errorModel.config["ndFlag"],
        "u": lambda row: np.linspace(27, 31, nbins)
    }

    
    # calculate the posterior by marginalizing over u and sampling
    # from the error distributions of the other galaxies
    pdfs_u_marginalized[nbins] = get_degr_post(f'degr_post_nbins_{nbins}',
                                               row3_degraded,
                                               marg_rules=marg_rules,
                                               **degr_post_kwargs)

    

Inserting handle into data store.  output_select_row3: inprogress_output_select_row3.pq, select_row3
Inserting handle into data store.  output_degr_post_degr_post_nbins_10: inprogress_output_degr_post_degr_post_nbins_10.hdf5, degr_post_degr_post_nbins_10
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'
Inserting handle into data store.  output_degr_post_degr_post_nbins_20: inprogress_output_degr_post_degr_post_nbins_20.hdf5, degr_post_degr_post_nbins_20
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'
Inserting handle into data store.  output_degr_post_degr_post_nbins_50: inprogress_output_degr_post_degr_post_nbins_50.hdf5, degr_post_degr_post_nbins_50
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'
Inserting handle into data store.  output_degr_post_degr_post_nbins_100: inprogress_output_degr_post_degr_post_nbins_100.hdf5, degr_post_degr_post_nbins_100
Warning.  Failed to convert column 'ArrayImpl' object has no attribute 'data'

In [19]:

fig, ax = plt.subplots(dpi=100)
for i in [10, 20, 50, 100]:
    pdfs_u_marginalized[i]()[0].plot_native(axes=ax, label=f"{i} bins")
ax.axvline(samples_degraded().iloc[3]["redshift"], label="True redshift", c="k")
ax.legend()
ax.set(xlabel="Redshift")
plt.show()

Notice that the resolution with only 10 bins is sufficient for this marginalization.

In this example, only one of the bands featured a non-detection, but you can easily marginalize over more bands by including corresponding rules in the marg_rules dict. For example, let's artificially force a non-detection in the y band as well:

In [20]:

row3_degraded()

Out[20]:

	redshift	u	u_err	g	g_err	r	r_err	i	i_err	z	z_err	y	y_err
3	0.330247	22.541234	0.010564	21.875726	0.005297	21.150399	0.005084	20.944225	0.005121	20.658345	0.005214	20.648385	0.005929

In [21]:

sample_double_degraded = row3_degraded().copy()
sample_double_degraded.iloc[0, 11:] *= np.nan
DS.add_data('double_degr', sample_double_degraded, TableHandle)
sample_double_degraded

Out[21]:

	redshift	u	u_err	g	g_err	r	r_err	i	i_err	z	z_err	y	y_err
3	0.330247	22.541234	0.010564	21.875726	0.005297	21.150399	0.005084	20.944225	0.005121	20.658345	0.005214	NaN	NaN

In [22]:

# set up the marginalization rules for u and y marginalization
marg_rules = {
    "flag": errorModel.config.ndFlag,
    "u": lambda row: np.linspace(27, 31, 10),
    "y": lambda row: np.linspace(21, 25, 10),
}

# calculate the posterior by marginalizing over u and y, and sampling
# from the error distributions of the other galaxies
#pdf_double_marginalized = get_degr_post('double_degr', 
#                                        DS['double_degr'],
#                                        flow_file=flow_file,
#                                        input='xx',
#                                        column='redshift',
#                                        grid=grid, 
#                                        err_samples=10000, 
#                                        seed=0, 
#                                        marg_rules=marg_rules)

In [23]:

#fig, ax = plt.subplots(dpi=100)
#pdf_double_marginalized.data[0].plot_native(axes=ax)
#ax.axvline(sample_double_degraded.iloc[0]["redshift"], label="True redshift", c="k")
#ax.legend()
#ax.set(xlabel="Redshift")
#plt.show()

Note that marginalizing over multiple bands quickly gets expensive

Creation Demo Notebooks

Using Engines and Degraders to Generate Galaxy Samples with Errors and Biases

Using Engines and Degraders to Generate Galaxy Samples with Errors and Biases¶

Specify the path to the pretrained 'pzflow' used to generate samples¶

"True" Engine¶

Let's check that the Engine correctly read the underlying PZ Flow object¶

Now we invoke the sample method to generate some samples¶

Degrader 1: LSSTErrorModel¶

Degrader 2: QuantityCut¶

Degrader 3: InvRedshiftIncompleteness¶

Degrader 4: LineConfusion¶

Grid Selection Degrader to Emulate HSC Training Samples

GridSelection degrader to emulate HSC training samples¶

Spectroscopic Selection Degrader to Emulate zCOSMOS Training Samples

Spectroscopic selection degrader to emulate zCOSMOS training samples¶

Using a Creator to Calculate True Posteriors for a Galaxy Sample

Using a Creator to Calculate True Posteriors for a Galaxy Sample¶

1. Calculating posteriors without errors¶

2. Calculating posteriors while convolving errors¶

3. Calculating posteriors with missing bands¶

Now we invoke the `sample` method to generate some samples¶