Estimation Demo Notebooks

The NZDir Estimator

NZDir

The NZDir estimator¶

Author: Sam Schmidt
Last successfully run: June 30, 2022

This is a quick demo of the NZDir estimator, it has been ported to RAIL based on Joe Zuntz's implementation in TXPipe here: https://github.com/LSSTDESC/TXPipe/blob/nz-dir/txpipe/nz_calibration.py
First off, let's load the relevant packages from RAIL

In [1]:

import os
import rail
import numpy as np
import pandas as pd
import tables_io
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

from rail.estimation.algos.NZDir import NZDir, Inform_NZDir
from rail.core.data import TableHandle
from rail.core.stage import RailStage

Next, let's set up the Data Store, so that our RAIL module will know where to fetch data:

In [3]:

DS = RailStage.data_store
DS.__class__.allow_overwrite = True

Next, we'll load some data into the Data Store:
test_dc2_training_9816.hdf5 contains ~10,000 galaxies from healpix 9816 of the cosmoDC2 "truth" catalog, and the "validation" data set contains ~20,000 galaxies from this same healpix pixel.

In [4]:

from rail.core.utils import RAILDIR
trainFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_training_9816.hdf5')
testFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_validation_9816.hdf5')
training_data = DS.read_file("training_data", TableHandle, trainFile)

Let's read test data in with tables_io, and then split it up into several tomographic bins. We can mock up some simple "tomographic" bins via their true redshift. The degrader expects a pandas DataFrame, so we will create three dataframes for each of a low, mid, and hi redshift sample. Let's also add a weight column to the test data while we are at it, this will be used later by the NZDir algorithm (for now we'll set all weights to one):

In [5]:

rawdata = tables_io.read(testFile)['photometry']
df = tables_io.convert(rawdata, tType=tables_io.types.PD_DATAFRAME)
szcol = rawdata['redshift']
numintest = len(szcol)
df['weight'] = np.ones(numintest, dtype='float')
lowmask = (szcol<=0.75)
midmask = np.logical_and(szcol>.75, szcol<1.25)
himask = (szcol>=1.25)
lowzdata = df[lowmask]
midzdata = df[midmask]
hizdata = df[himask]
low_bin = DS.add_data("lowz_bin", lowzdata, TableHandle)
mid_bin = DS.add_data("midz_bin", midzdata, TableHandle)
hi_bin = DS.add_data("hiz_bin", hizdata, TableHandle)

The algorithm:¶

The NZDir estimator tries to reconstruct the redshift distribution for an unknown sample (which we'll alternately call the "photometric sample", as it has photometric, but not spectroscopic information for each galaxy) by finding spectroscopic galaxies with similar magnitudes/colors and assigning a redshift based on those similarly-colored objects. In practice, this particular algorithm actually reverses that process, it defines a neighborhood around each spectroscopic object (based on the distance to the Nth nearest neighbor, where N is defined by the user via the parameter n_neigh). Then, it loops over the set of all spectroscopic objects and adds its (weighted) redshift to a histogram for each photometric object that it finds within the annulus. This process is more efficient computationally, and has the benefit of automatically "ignoring" photometric objects that have no similarly colored spectroscopic objects nearby. However, that could also be seen as a limitation, as if there are areas of color^N space not covered by your training sample, those galaxies will be "skipped" when assembling the tomographic redshift N(z) estimate, which can lead to biased results, as we will show later in this demo.

Like PDF estimators, the algorithm is broken up into an "inform" stage and an "estimate" stage. The inform stage creates the neighbors for the spectroscopic samples and calculates the distance to the Nth nearest neighbor that is used to determine annulus checks around each spec-z object. These quantites are stored in a specified model file that is loaded and used by the estimate stage.

Let's also add a "weight column" to the training data to test functionality. For simplicity we already set the weights to 1.0 for all photometric galaxies a few cells above, and now let's set weights of 0.5 for all spectroscopic galaxies. This should have no impact on the recovery compared to having no weights included. Note that if weights are not included, the algorithm will set all weights to 1.0. However, these weights could be used in more realistic analyses to reweight training or test samples to account for various biases.

In [6]:

numinphot = len(training_data()['photometry']['redshift'])
training_data()['photometry']['weight'] = np.ones(numinphot, dtype='float')*0.5

In [7]:

zmin = 0.0
zmax = 3.0
xmanybins = 25

Now, let's set up or estimator, first creating a stage for the informer. We define any input variables in a dictionary and then use that with make_stage to create an instance of our NZDir summarizer. We'll create a histogram of 25 bins, using 5 nearest neighbors to define our specz neighborhood, and above we defined our bin column as "bin":

In [8]:

train_nzdir = Inform_NZDir.make_stage(name='train_nzdir', n_neigh=5,
                                      szweightcol='weight', model="NZDir_model.pkl")

In [9]:

train_nzdir.inform(training_data)

Inserting handle into data store.  model_train_nzdir: inprogress_NZDir_model.pkl, train_nzdir

Out[9]:

<rail.core.data.ModelHandle at 0x7fa8284e4ac0>

Now, let's set up our NZDir estimator, providing parameters for the redshift grid, photomtetric weight column, and the model that we created with the informer. Note that NZDir returns N bootstrap samples rather than just a single distribution. The code draws bootstrap samples from the spectroscopic sample to use as input as the training data. nsamples can be used to set the number of bootstrap samples returned, for this demo we will only generate 20:

In [10]:

summdict = dict(leafsize=20, zmin=zmin, zmax=zmax, nzbins=xmanybins, nsamples=20,
                phot_weightcol='weight', model="NZDir_model.pkl", hdf5_groupname='')

We have three tomographic bins, we can make a stage and run each one in a loop. To run our Nz Estimator we just need to run estimate with arguments for the test and training data handles as named in the Data Store:
The code uses a fast Nearest Neighbor calculation and KDTree calculation, so this should run very fast:

In [11]:

%%time
bin_ens = {}
binnames = ['low', 'mid', 'hi']
bin_datasets = [low_bin, mid_bin, hi_bin]
for bin, indata in zip(binnames, bin_datasets):
    nzsumm = NZDir.make_stage(name=f'nzsumm_{bin}', **summdict)
    bin_ens[f'{bin}'] = nzsumm.estimate(indata)

Inserting handle into data store.  model: NZDir_model.pkl, nzsumm_low
Inserting handle into data store.  output_nzsumm_low: inprogress_output_nzsumm_low.hdf5, nzsumm_low
Inserting handle into data store.  single_NZ_nzsumm_low: inprogress_single_NZ_nzsumm_low.hdf5, nzsumm_low
Inserting handle into data store.  output_nzsumm_mid: inprogress_output_nzsumm_mid.hdf5, nzsumm_mid
Inserting handle into data store.  single_NZ_nzsumm_mid: inprogress_single_NZ_nzsumm_mid.hdf5, nzsumm_mid
Inserting handle into data store.  output_nzsumm_hi: inprogress_output_nzsumm_hi.hdf5, nzsumm_hi
Inserting handle into data store.  single_NZ_nzsumm_hi: inprogress_single_NZ_nzsumm_hi.hdf5, nzsumm_hi
CPU times: user 462 ms, sys: 0 ns, total: 462 ms
Wall time: 466 ms

indeed, for our 20,000 test and 10,000 training galaxies, it takes less than a second to run all three bins! Now, let's plot our estimates and compare to the true distributions in our tomo bins. While the ensembles actually contain 20 distributions, we will plot only the first bootstrap realization for each bin:

In [12]:

samebins = np.linspace(zmin,zmax, xmanybins)
binsize= samebins[1]-samebins[0]
bincents = 0.5*(samebins[1:] + samebins[:-1])

fig, axs = plt.subplots(1,3, figsize=(20,6))
bin_datasets = [low_bin, mid_bin, hi_bin]
binnames = ['low', 'mid', 'hi']
for ii, (bin, indata) in enumerate(zip(binnames, bin_datasets)):
    truehist, bins = np.histogram(indata()['redshift'], bins=samebins)
    norm = np.sum(truehist)*binsize
    truehist = np.array(truehist)/norm
    bin_ens[f'{bin}']().plot_native(axes=axs[ii],label="DIR estimate")
    axs[ii].bar(bincents, truehist,alpha=0.55, width=binsize, color='b', label="true redshift distn")
plt.legend(loc='upper right', fontsize=12)
plt.title("representative training data", fontsize=15)
plt.xlabel("redshift", fontsize=12)
plt.ylabel("N(z)", fontsize=12)

Out[12]:

Text(0, 0.5, 'N(z)')

Non-representative data¶

That looks very nice, while there is a little bit of "slosh" outside of each bin, we have a relatively compact and accurate representation from the DIR method! This makes sense, as our training and test data are drawn from the same underlying distribution (in this case cosmoDC2_v1.1.4). However, how will things look if we are missing chunks of data, or have incorrect redshifts in our spec-z sample? We can use RAIL's degradation modules to do just that: place incorrect redshifts for percentage of the training data, and we can make a magnitude cut that will limite the redshift and color range of our training data:

Let's import the necessary modules from rail.creation.degradation, we will put in "line confusion" for 5% of our sample, and then cut the sample at magnitude 23.5:

In [13]:

from rail.creation.degradation import LineConfusion, QuantityCut
from rail.core.data import PqHandle

In [14]:

line_confusion = LineConfusion.make_stage(name='line_confusion', hdf5_groupname='photometry',
                                          true_wavelen=5007., wrong_wavelen=3727., frac_wrong=0.05)

quantity_cut = QuantityCut.make_stage(name='quantity_cut', hdf5_groupname='photometry',
                                      cuts={'mag_i_lsst': 23.5})

The degrader expects a pandas dataframe, so let's construct one and add it to the data store, we'll strip out the 'photometry' hdf5 while we're at it:

In [15]:

degrade_df = pd.DataFrame(training_data.data['photometry'])
degrade_data = DS.add_data("degrade_data", degrade_df, PqHandle)

Now, apply our degraders:

In [16]:

train_data_conf = line_confusion(degrade_data)
train_data_cut = quantity_cut(train_data_conf)

Inserting handle into data store.  output_line_confusion: inprogress_output_line_confusion.pq, line_confusion
Inserting handle into data store.  output_quantity_cut: inprogress_output_quantity_cut.pq, quantity_cut

Let's plot our trimmed training sample, we see that we have fewer galaxies, so we'll be subject to more "shot noise"/discretization of the redshifts, and we are very incomplete at high redshift.

In [17]:

#compare original specz data to degraded data
fig = plt.figure(figsize=(10,6))
xbins = np.linspace(0,3,41)
plt.hist(training_data()['photometry']['redshift'],bins=xbins,alpha=0.75, label='original training data');
plt.hist(train_data_cut()['redshift'], bins=xbins,alpha=0.75, label='trimmed training data');
plt.legend(loc='upper right', fontsize=15)
plt.xlabel("redshift", fontsize=15)
plt.ylabel("N", fontsize=15)

Out[17]:

Text(0, 0.5, 'N')

Let's re-run our estimator on the same test data but now with our incomplete training data:

In [18]:

xinformdict = dict(n_neigh=5, bincol="bin", szweightcol='weight',
                   model="NZDir_model_incompl.pkl", hdf5_groupname='')
newsumm_inform = Inform_NZDir.make_stage(name='newsumm_inform', **xinformdict)

In [19]:

newsumm_inform.inform(train_data_cut)

Inserting handle into data store.  model_newsumm_inform: inprogress_NZDir_model_incompl.pkl, newsumm_inform

Out[19]:

<rail.core.data.ModelHandle at 0x7fa7df811c10>

Now we need to re-run our tomographic bin estimates with this new model:

In [20]:

%%time
xestimatedict = dict(leafsize=20, zmin=zmin, zmax=zmax, nzbins=xmanybins, hdf5_groupname='', nsamples=20,
                     phot_weightcol='weight', model=newsumm_inform.get_handle('model'))
new_ens = {}
binnames = ['low', 'mid', 'hi']
bin_datasets = [low_bin, mid_bin, hi_bin]
for bin, indata in zip(binnames, bin_datasets):
    nzsumm = NZDir.make_stage(name=f'nzsumm_{bin}', **xestimatedict)
    new_ens[f'{bin}'] = nzsumm.estimate(indata)

Inserting handle into data store.  output_nzsumm_low: inprogress_output_nzsumm_low.hdf5, nzsumm_low
Inserting handle into data store.  single_NZ_nzsumm_low: inprogress_single_NZ_nzsumm_low.hdf5, nzsumm_low
Inserting handle into data store.  output_nzsumm_mid: inprogress_output_nzsumm_mid.hdf5, nzsumm_mid
Inserting handle into data store.  single_NZ_nzsumm_mid: inprogress_single_NZ_nzsumm_mid.hdf5, nzsumm_mid
Inserting handle into data store.  output_nzsumm_hi: inprogress_output_nzsumm_hi.hdf5, nzsumm_hi
Inserting handle into data store.  single_NZ_nzsumm_hi: inprogress_single_NZ_nzsumm_hi.hdf5, nzsumm_hi
CPU times: user 129 ms, sys: 147 µs, total: 129 ms
Wall time: 130 ms

In [21]:

fig, axs = plt.subplots(1,3, figsize=(20,6))
samebins = np.linspace(0,3, xmanybins)
binsize= samebins[1]-samebins[0]
bincents = 0.5*(samebins[1:] + samebins[:-1])
bin_datasets = [low_bin, mid_bin, hi_bin]
binnames = ['low', 'mid', 'hi']

for ii, (bin, indata) in enumerate(zip(binnames, bin_datasets)):
    truehist, bins = np.histogram(indata.data['redshift'], bins=samebins)
    norm = np.sum(truehist)*binsize
    truehist = np.array(truehist)/norm
    new_ens[f'{bin}']().plot_native(axes=axs[ii],label="DIR estimate")
    axs[ii].bar(bincents, truehist,alpha=0.55, width=binsize, color='b', label="true redshift distn")
axs[0].legend(loc='upper right', fontsize=12)
axs[1].set_title("non-representative training data", fontsize=15)
axs[1].set_xlabel("redshift", fontsize=15)
axs[0].set_ylabel("N(z)", fontsize=15);

We see that the high redshift bin, where our training set was very incomplete, looks particularly bad, as expected. Bins 1 and 2 look surprisingly good, which is a promising sign that, even when a brighter magnitude cut is enforced, this method is sometimes still able to produce reasonable results.

In [ ]:

Testing Sampled Summarizers

test_sampled_summarizers.ipynb

test_sampled_summarizers

June 28 update: I modified the summarizers to output not just N sample N(z) distributions (saved to the file specified via the output keyword), but also the single fiducial N(z) estimate (saved to the file specified via the single_NZ keyword). I also updated NZDir and included it in this example notebook

In [1]:

import os
import rail
import numpy as np
import pandas as pd
import tables_io
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

from rail.estimation.algos.knnpz import Inform_KNearNeighPDF, KNearNeighPDF

In [3]:

from rail.estimation.algos.varInference import VarInferenceStack
from rail.estimation.algos.naiveStack import NaiveStack
from rail.estimation.algos.pointEstimateHist import PointEstimateHist
from rail.estimation.algos.NZDir import Inform_NZDir, NZDir
from rail.core.data import TableHandle, QPHandle
from rail.core.stage import RailStage

In [4]:

import qp

In [5]:

DS = RailStage.data_store
DS.__class__.allow_overwrite = True

To create some N(z) distributions, we'll want some PDFs to work with first, for a quick demo let's just run some photo-z's using the KNearNeighPDF estimator using the 10,000 training galaxies to generate ~20,000 PDFs using data from healpix 9816 of cosmoDC2_v1.1.4 that are included in the RAIL repo:

In [6]:

knn_dict = dict(zmin=0.0, zmax=3.0, nzbins=301, trainfrac=0.75,
                sigma_grid_min=0.01, sigma_grid_max=0.07, ngrid_sigma=10,
                nneigh_min=3, nneigh_max=7, hdf5_groupname='photometry')

In [7]:

pz_train = Inform_KNearNeighPDF.make_stage(name='inform_KNN', model='demo_knn.pkl', **knn_dict)

In [8]:

# Load up the example healpix 9816 data and stick in the DataStore
from rail.core.utils import RAILDIR
trainFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_training_9816.hdf5')
testFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_validation_9816.hdf5')
training_data = DS.read_file("training_data", TableHandle, trainFile)
test_data = DS.read_file("test_data", TableHandle, testFile)

In [9]:

# train knnpz
pz_train.inform(training_data)

split into 7669 training and 2556 validation samples
finding best fit sigma and NNeigh...



best fit values are sigma=0.03 and numneigh=7



Inserting handle into data store.  model_inform_KNN: inprogress_demo_knn.pkl, inform_KNN

Out[9]:

<rail.core.data.ModelHandle at 0x7fc618ad1fa0>

In [10]:

pz = KNearNeighPDF.make_stage(name='KNN', hdf5_groupname='photometry',
                              model=pz_train.get_handle('model'))
qp_data = pz.estimate(test_data)

Process 0 running estimator on chunk 0 - 10000
Process 0 estimating PZ PDF for rows 0 - 10,000
Inserting handle into data store.  output_KNN: inprogress_output_KNN.hdf5, KNN
Process 0 running estimator on chunk 10000 - 20000
Process 0 estimating PZ PDF for rows 10,000 - 20,000
Process 0 running estimator on chunk 20000 - 20449
Process 0 estimating PZ PDF for rows 20,000 - 20,449

So, qp_data now contains the 20,000 PDFs from KNearNeighPDF, we can feed this in to three summarizers to generate an overall N(z) distribution. We won't bother with any tomographic selections for this demo, just the overall N(z). It is stored as qp_data, but has also been saved to disk as output_KNN.fits as an astropy table. If you want to read in this data to grab the qp Ensemble at a later stage, you can do this via qp with a ens = qp.read("output_KNN.fits")

I coded up quick and dirty bootstrap versions of the NaiveStack, PointEstimateHist, and VarInference sumarizers. These are not optimized, not parallel (issue created for future update), but they do produce N different bootstrap realizations of the overall N(z) which are returned as a qp Ensemble (Note: the previous versions of these degraders returned only the single overall N(z) rather than samples).

Naive Stack¶

Naive stack just "stacks" i.e. sums up, the PDFs and returns a qp.interp distribution with bins defined by np.linspace(zmin, zmax, nzbins), we will create a stack with 41 bins and generate 20 bootstrap realizations

In [11]:

stacker = NaiveStack.make_stage(zmin=0.0, zmax=3.0, nzbins=41, nsamples=20, output="Naive_samples.hdf5", single_NZ="NaiveStack_NZ.hdf5")

In [12]:

naive_results = stacker.summarize(qp_data)

Inserting handle into data store.  output: inprogress_Naive_samples.hdf5, NaiveStack
Inserting handle into data store.  single_NZ: inprogress_NaiveStack_NZ.hdf5, NaiveStack

The results are now in naive_results, but because of the DataStore, the actual ensemble is stored in .data, let's grab the ensemble and plot a few of the bootstrap sample N(z) estimates:

In [13]:

newens = naive_results.data

In [14]:

fig, axs = plt.subplots(figsize=(8,6))
for i in range(0, 20, 2):
    newens[i].plot_native(axes=axs, label=f"sample {i}")
axs.plot([0,3],[0,0],'k--')
axs.set_xlim(0,3)
axs.legend(loc='upper right')

Out[14]:

<matplotlib.legend.Legend at 0x7fc618a6e4c0>

The summarizer also outputs a second file containing the fiducial N(z). We saved the fiducial N(z) in the file "NaiveStack_NZ.hdf5", let's grab the N(z) estimate with qp and plot it with the native plotter:

In [15]:

naive_nz = qp.read("NaiveStack_NZ.hdf5")
naive_nz.plot_native(xlim=(0,3))

Out[15]:

<Axes: xlabel='redshift', ylabel='p(z)'>

Point Estimate Hist¶

PointEstimateHist takes the point estimate mode of each PDF and then histograms these, we'll again generate 41 bootstrap samples of this and plot a few of the resultant histograms. Note: For some reason the plotting on the histogram distribution in qp is a little wonky, it appears alpha is broken, so this plot is not the best:

In [16]:

pointy = PointEstimateHist.make_stage(zmin=0.0, zmax=3.0, nzbins=41, nsamples=20, single_NZ="point_NZ.hdf5", output="point_samples.hdf5")

In [17]:

%%time
pointy_results = pointy.summarize(qp_data)

Inserting handle into data store.  output: inprogress_point_samples.hdf5, PointEstimateHist
Inserting handle into data store.  single_NZ: inprogress_point_NZ.hdf5, PointEstimateHist
CPU times: user 35.8 ms, sys: 2.74 ms, total: 38.5 ms
Wall time: 38.2 ms

In [18]:

pens = pointy_results.data

In [19]:

fig, axs = plt.subplots(figsize=(8,6))
pens[0].plot_native(axes=axs, fc = [0, 0, 1, 0.01])
pens[1].plot_native(axes=axs, fc = [0, 1, 0, 0.01])
pens[4].plot_native(axes=axs, fc = [1, 0, 0, 0.01])
axs.set_xlim(0,3)
axs.legend()

Out[19]:

<matplotlib.legend.Legend at 0x7fc5dc871f10>

Again, we have saved the fiducial N(z) in a separate file, "point_NZ.hdf5", we could read that data in if we desired.

varInference¶

VarInference implements Markus' variational inference scheme and returns qp.interp gridded distribution. varInference tends to get a little wonky if you use too many bins, so we'll only use 25 bins. Again let's generate 20 samples and plot a few:

In [20]:

runner=VarInferenceStack.make_stage(name='test_varinf', zmin=0.0,zmax=3.0,nzbins=25, niter=10, nsamples=20,
                                    output="sampletest.hdf5", single_NZ="varinf_NZ.hdf5")

In [21]:

%%time
varinf_results = runner.summarize(qp_data)

Inserting handle into data store.  output_test_varinf: inprogress_sampletest.hdf5, test_varinf
Inserting handle into data store.  single_NZ_test_varinf: inprogress_varinf_NZ.hdf5, test_varinf
CPU times: user 1.94 s, sys: 127 ms, total: 2.07 s
Wall time: 2.07 s

In [22]:

vens = varinf_results.data
vens

Out[22]:

<qp.ensemble.Ensemble at 0x7fc5dc655190>

Let's plot the fiducial N(z) for this distribution:

In [23]:

varinf_nz = qp.read("varinf_NZ.hdf5")
varinf_nz.plot_native(xlim=(0,3))

Out[23]:

<Axes: xlabel='redshift', ylabel='p(z)'>

NZDir¶

NZDir is a different type of summarizer, taking a weighted set of neighbors to a set of training spectroscopic objects to reconstruct the redshift distribution of the photometric sample. I implemented a bootstrap of the spectroscopic data rather than the photometric data, both because it was much easier computationally, and I think that the spectroscopic variance is more important to take account of than simple bootstrap of the large photometric sample. We must first run the inform_NZDir stage to train up the K nearest neigh tree used by NZDir, then we will run NZDir to actually construct the N(z) estimate.

Like PointEstimateHist NZDir returns a qp.hist ensemble of samples

In [24]:

inf_nz = Inform_NZDir.make_stage(n_neigh=8, hdf5_groupname="photometry", model="nzdir_model.pkl")

In [25]:

inf_nz.inform(training_data)

Inserting handle into data store.  model: inprogress_nzdir_model.pkl, Inform_NZDir

Out[25]:

<rail.core.data.ModelHandle at 0x7fc618af8d60>

In [26]:

nzd = NZDir.make_stage(leafsize=20, zmin=0.0, zmax=3.0, nzbins=31, model="NZDir_model.pkl", hdf5_groupname='photometry',
                       output='NZDir_samples.hdf5', single_NZ='NZDir_NZ.hdf5')

In [27]:

nzd_res = nzd.estimate(test_data)

Inserting handle into data store.  output: inprogress_NZDir_samples.hdf5, NZDir
Inserting handle into data store.  single_NZ: inprogress_NZDir_NZ.hdf5, NZDir

In [28]:

nzd_ens = nzd_res.data

In [29]:

nzdir_nz = qp.read("NZDir_NZ.hdf5")

In [30]:

fig, axs = plt.subplots(figsize=(10,8))
nzd_ens[0].plot_native(axes=axs, fc = [0, 0, 1, 0.01])
nzd_ens[1].plot_native(axes=axs, fc = [0, 1, 0, 0.01])
nzd_ens[4].plot_native(axes=axs, fc = [1, 0, 0, 0.01])
axs.set_xlim(0,3)
axs.legend()

Out[30]:

<matplotlib.legend.Legend at 0x7fc5dc615820>

As we also wrote out the single estimate of N(z) we can read that data from the second file written (specified by the single_NZ argument given in NZDir.make_stage above, in this case "NZDir_NZ.hdf5")

In [31]:

nzdir_nz = qp.read("NZDir_NZ.hdf5")

In [32]:

nzdir_nz.plot_native(xlim=(0,3))

Out[32]:

<Axes: xlabel='redshift', ylabel='p(z)'>

Results¶

All three results files are qp distributions, NaiveStack and varInference return qp.interp distributions while pointEstimateHist returns a qp.histogram distribution. Even with the different distributions you can use qp functionality to do things like determine the means, modes, etc... of the distributions. You could then use the std dev of any of these to estimate a 1 sigma "shift", etc...

In [33]:

zgrid = np.linspace(0,3,41)
names = ['naive', 'point', 'varinf', 'nzdir']
enslist = [newens, pens, vens, nzd_ens]
results_dict = {}
for nm, en in zip(names, enslist):
    results_dict[f'{nm}_modes'] = en.mode(grid=zgrid).flatten()
    results_dict[f'{nm}_means'] = en.mean().flatten()
    results_dict[f'{nm}_std'] = en.std().flatten()

In [34]:

results_dict

Out[34]:

{'naive_modes': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9,
        0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
 'naive_means': array([0.91149502, 0.90689545, 0.91136832, 0.90765168, 0.90561592,
        0.91330512, 0.91197592, 0.90824209, 0.91451169, 0.90922789,
        0.9087516 , 0.91218998, 0.91076003, 0.91391544, 0.90809329,
        0.90649634, 0.90949125, 0.91140398, 0.90799529, 0.91011015]),
 'naive_std': array([0.45672834, 0.45779718, 0.45799255, 0.45608257, 0.45772315,
        0.45806096, 0.45664706, 0.45644607, 0.45852141, 0.45562806,
        0.46094525, 0.45920772, 0.4539561 , 0.45695217, 0.45396763,
        0.45660786, 0.45550625, 0.45593187, 0.45400031, 0.456491  ]),
 'point_modes': array([0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  ,
        0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.975,
        0.9  , 0.9  ]),
 'point_means': array([0.90757435, 0.90355244, 0.90913087, 0.90599994, 0.90128744,
        0.91024727, 0.90898774, 0.90411422, 0.91178947, 0.90571726,
        0.90523062, 0.90943859, 0.90735965, 0.91075179, 0.90469031,
        0.90337353, 0.90669411, 0.90870864, 0.90467242, 0.90689091]),
 'point_std': array([0.44423985, 0.44567538, 0.4462743 , 0.44507997, 0.44419829,
        0.44549633, 0.4453005 , 0.44285937, 0.44654285, 0.44260597,
        0.44841522, 0.44692404, 0.44150332, 0.44463125, 0.44197287,
        0.44544701, 0.44275794, 0.44302682, 0.44123106, 0.44419575]),
 'varinf_modes': array([0.975, 0.975, 0.9  , 0.9  , 0.975, 0.975, 0.975, 0.9  , 0.9  ,
        0.9  , 0.9  , 0.975, 0.975, 0.975, 0.975, 0.9  , 0.975, 0.975,
        0.975, 0.975]),
 'varinf_means': array([0.88854022, 0.89158319, 0.89110634, 0.89315605, 0.89387354,
        0.89097531, 0.89294196, 0.89228821, 0.88712166, 0.88899743,
        0.88741712, 0.88722119, 0.89265368, 0.89404193, 0.89334961,
        0.89796394, 0.89413918, 0.88890824, 0.89510198, 0.88763958]),
 'varinf_std': array([0.41459107, 0.41830515, 0.41732654, 0.41270001, 0.41330109,
        0.41402322, 0.41774209, 0.41649308, 0.41708491, 0.41661354,
        0.41329181, 0.417675  , 0.41721839, 0.41584016, 0.41277957,
        0.41845365, 0.41651765, 0.41395451, 0.41459264, 0.4135    ]),
 'nzdir_modes': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9,
        0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
 'nzdir_means': array([0.91798956, 0.92176052, 0.91434658, 0.92033863, 0.91362204,
        0.92948617, 0.92304575, 0.91872134, 0.91385976, 0.92238134,
        0.9252893 , 0.92597024, 0.92721059, 0.92450514, 0.92192798,
        0.91739566, 0.92603753, 0.92105897, 0.91872165, 0.9210996 ]),
 'nzdir_std': array([0.46447022, 0.46621283, 0.46657924, 0.4652038 , 0.46269032,
        0.46635063, 0.46901116, 0.46889646, 0.46817098, 0.46865234,
        0.46976555, 0.46632455, 0.46713524, 0.46255762, 0.4702946 ,
        0.46672201, 0.46883034, 0.46173737, 0.46291822, 0.46649785])}

You can also use qp to compute quantities the pdf, cdf, ppf, etc... on any grid that you want, much of the functionality of scipy.stats distributions have been inherited by qp ensembles

In [35]:

newgrid = np.linspace(0.005,2.995, 35)
naive_pdf = newens.pdf(newgrid)
point_cdf = pens.cdf(newgrid)
var_ppf = vens.ppf(newgrid)

In [36]:

plt.plot(newgrid, naive_pdf[0])

Out[36]:

[<matplotlib.lines.Line2D at 0x7fc5dc524490>]

In [37]:

plt.plot(newgrid, point_cdf[0])

Out[37]:

[<matplotlib.lines.Line2D at 0x7fc5dc45b6d0>]

In [38]:

plt.plot(newgrid, var_ppf[0])

Out[38]:

[<matplotlib.lines.Line2D at 0x7fc5dc391dc0>]

Shifts¶

If you want to "shift" a PDF, you can just evaluate the PDF on a shifted grid, for example to shift the PDF by +0.0375 in redshift you could evaluate on a shifted grid. For now we can just do this "by hand", we could easily implement shift functionality in qp, I think.

In [39]:

def_grid = np.linspace(0., 3., 41)
shift_grid = def_grid - 0.0675
native_nz = newens.pdf(def_grid)
shift_nz = newens.pdf(shift_grid)

In [40]:

fig=plt.figure(figsize=(12,10))
plt.plot(def_grid, native_nz[0], label="original")
plt.plot(def_grid, shift_nz[0], label="shifted +0.0675")
plt.legend(loc='upper right')

Out[40]:

<matplotlib.legend.Legend at 0x7fc5dc371f70>

You can estimate how much shift you might expect based on the statistics of our bootstrap samples, say the std dev of the means for the NZDir-derived distribution:

In [41]:

results_dict['nzdir_means']

Out[41]:

array([0.91798956, 0.92176052, 0.91434658, 0.92033863, 0.91362204,
       0.92948617, 0.92304575, 0.91872134, 0.91385976, 0.92238134,
       0.9252893 , 0.92597024, 0.92721059, 0.92450514, 0.92192798,
       0.91739566, 0.92603753, 0.92105897, 0.91872165, 0.9210996 ])

In [42]:

spread = np.std(results_dict['nzdir_means'])

In [43]:

spread

Out[43]:

0.0043607326241909454

Again, not a huge spread in predicted mean redshifts based solely on bootstraps, even with only ~20,000 galaxies.

In [ ]: