Overall completeness: 0%
Drell-Yan analysis Procedure
This twiki documents the most important steps of the Drell-Yan cross section measurement. It is intended to familiarize you with the technical aspects of the analysis procedure.
Step 1: Producing ntuples
- The CMSSW_53X MC samples are used for 8 TeV analysis. Below is the list of starting GEN-SIM-RECO samples used in the muon and electro analyses:
We use SingleMu and DoubleMu Primary Datasets (PD), January2013 ReReco version
- JSONs: Cert190456-2086868TeV22Jan2013ReRecoCollisions12JSON.txt, Jan22Jan2013
- Double muon and double electron samples are used for the main analysis, single muon samples are used for the efficiency correction estimation steps. Other samples are used for the backgrounds estimation purposes.
- Relevant software: CMSSW_5_3_3_patch2
To simply perform a local test of the ntuple-maker run:
to produce the ntuples over full dataset use CRAB:
Step 3: Event Selection
Once the ntuples are ready, one can proceed to the actual physics analysis. The first step of the analysis is the event selection. Currently, we use the so-called cut-based approach to discriminate between signal and background. For more on event selection read chapter 5 in the analysis note CMS-AN-13-420. Before starting to run a macro, set up the working area. Find all the necessary scripts in:
The code for event selection consists of 3 main files (and a few auxiliary). First of all the TSelector class which is customized for event selection used in a given analysis, necessary weights (pileup, FEWZ and momentum scale correction) are applied in the macro. The Monte-Carlo weights are also hardcoded inside the macro for each MC sample used. Next, is the wrapper ROOT macro which calls the TSelector to run on a given dataset. This wrapper is shown below, and explained step-by-step:
- There is one extra level here - the python script. It calls the above ROOT wrapper macro and typically looks like this:
Once this is understood, one can run the macro. To produce plots like 35-37 use the analyse.py macro, which calls the wrapper for TSelector for the DY analysis (as described above):
Important information about the reweightings. Pileup reweighing is accessed from the ntuple, directly from the branch on a per event basis. The FEWZ weights are extracted from theoretical calculation, and are provided as arrays inside the efficiencyWeightToBin2012.C file located in the same directory (or any other directory, as long as there is an appropriate include in the header of the TSelector). The FEWZ weights are looked up based on the GEN mass as follows inside the code, only for signal MC:
To Finally, the Rochester momentum scale correction recipe is described here: http://www-cdf.fnal.gov/~jyhan/cms_momscl/cms_rochcor_manual.html
Few words about the normalization. The data events are not renormalized. The MC weights are weighted according to the probability of each event to be observed in a real collision event and according to the number of events generated in the sample. Therefore
For better accuracy we use the number of events actually ran on, rather than the number generated. We calculate it in the event loop, and apply it in the EventSelector::Terminate() method. In both the 7 and 8 TeV analysis, we normalized the MC tack (signal and backgrounds) to the number of events in data in the Z peak region (before the efficiency corrections). A special post-processing macro takes care of this:
This python script adds up individual ROOT files with hadd and invokes ROOT macros parser.C and parser_2D.C which has a method for normalization of MC stack to data in the Z peak region.
After that, switch to the Dielectron working directory and produce necessary yield histograms before continuing with the style plotting
Inspect the wrapper_EE.sh file inside and set the do_selection flag to 1 (true), and check the input files to run on are properly specified in the conf_file
Then run in two steps: (1) produce reduced ntuples, (2) prepare binned yields for analysis
After that, the style macro is used to plot the publication quality plots.
the style macro is used This would plot the 1D yields distribution (the switch between the electrons and muons is done manually inside the macro by adjusting the paths).
To plot the 2D distributions do:
Step 4: Acceptance and Efficiency estimation
Another constituent of the cross-section measurement is the acceptance-efficiency.
- Acceptance is determined using GEN level information
To be able to produce the acceptance and efficiency one needs to change to a different folder, and run a different TSelector. But the general flow TSelector->ROOT wrapper->python wrapper is almost the same:
The script will produce the root file with histograms corresponding to the mass and rapidity spectra after the acceptance cuts, selection cuts or both which are then used to calculate the acceptances, efficiencies and acceptance-efficiency products with and without pileup and FEWZ reweighing by executing:
To get the corresponding distributions in the electron channel change to XX
The macro output a root file starting with out1* or out2* containing the histograms corresponding to the acceptance, efficiency and their product. To produce the publication level plots, the style macro described in the previous section needs to be used again
To get the 2D plots do:
Step 5: Data-driven efficiency correction
Only in the muon channel, the electron efficiency scale factors are obtained from the EGamma group, and not re-measured independently.
Next, the data-driven efficiency corrections are applied. This is done using the standard CMSSW recipe, so a lot of additional packages needs to be checked out. Follow this twiki: https://twiki.cern.ch/twiki/bin/viewauth/CMS/MuonTagAndProbe to set up your working area for the ntuple production (alternatively, one can use the trees already produced!)
- The procedure goes in two steps: T&P tree production -> rerun seldom (ideally once), it depends only on the definitions of the tag and probe
- If you haven't produced TP trees you can always use the official ntuples located as described in MuonTagAndProbe twiki:
- Second step of the procedure is fitting: separate job for trigger and all the muonID related efficiencies -> reran frequently and usually interactively (change binning, definitions)
After familiarizing yourself with the TagAndProbe package, you need to produce the muon efficiencies as a function of pT and eta. You can use the wrapper.py script specifying which variables to bin the efficiency in and what runs/MC samples to process.
Finally, produce the plots with
Step 6: Background estimation
QCD data driven background estimation
In 8 TeV analysis, the main method to estimate the QCD background in the dimuon channel is the ABCD method (the fake-rate method is used in the electron channel). Before starting, let me summarize the ABCD method in a nutshell:
1) choose 2 variables: assume two variables are independent
2) assume the fraction should be same if there is no correlation: N_A / N_B = N_C / N_D
3) In our study, use two variables: sign of muon pair, muon isolation
4) QCD fraction in each region has a dependence. We produce the correction factor for each region: B, C, D
5) Produce N_B, N_C, N_D from data sample, and estimate N_A from them at the end (applying the correction factors)
Now, let's go step by step.
First, change to the ABCD folder:
The procedure consists of few steps and is guided by the wrapper.py script located inside the folder:
Thus, for each of the MC samples and for the real data a set of sequences is ran. First the QCDFrac_*.py, which invoke the EventSelector_Bkg.C TSelector class for various values of charge and isolation (the variables defining the signal and background regions), based on the histograms filled, the coefficients are calculated. Second, the qcdFracHadder.py scripts is ran on the on the output of the first step. It is a utility script which repacks the histograms in an appropriate format. Third, the ABCD2vari_init.py script which actually performs the etiolation of ABCD coefficients in each region. Finally, the ABCD2vari_*.py scripts invoke the EventSelector_Bkg2.C TSelector class, passing the ABCD coefficients as TObjString objects inside the macro.
The post-processing and the output harvesting step is performed by the following python script:
It uses the output of the second TSelector as an inout, hadds it and produces a root file with th histogram which is then used in the analysis.
E-mu data-driven background estimation method
To estimate all the non-QCD backgrounds we employ the so-called e-mu data driven background estimation method. The same method is applied in the muon and electron channels. The code used for that purpose was originally adapted from Manny and it uses the so-called Bambu workflow. First, let's change into the e-mu working directory:
First, reduced ntuples are generated from the original Bambu ntuples:
One will have to edit the data_emu.conf to point to the local ntuples before running. After running this step, the reduced ntuples should be output to a directory (../root_files/selected_events/DY/ntuples/EMU/). One would also need to run selectEvents.C to generate reduced electron ntuples.These ntuples must contain two branches, mass (dilepton invariant mass) and weight. Sfter this is done, the e-mu macro can be ran:
After this step, to produce a final root file with histograms, one can run the following script
Step 7: Unfolding
Unfolding is applied to correct for migration of entries between bins caused by mass resolution effects (FSR correction is taken into account as a separate step). For use in the Drell-Yan analysis, the choice for unfolding is matrix inversion. Provides a common interface between channels for symmetry and ease in combination and systematic studies.
To do any unfolding with MC, this requires 3 things:
- Producing the response matrix
- Making the histogram of measured events
- Making the true histogram (clearly not used/available when unfolding data)
First, one can do some exercise, for that use script that demonstrates how the unfolding/fore-folding object works.
To get back the pulls:
The macros in the note are produced with the following:
1. To rpoduce the response matrix:
2. To produce the unfolded yield plot do
Checkpoint7 with this macros one should be able to reproduce the plot 49-50 from the note and Tables 17-18 (note, the table 18 uses the background yield result from the background section)
Step 8: FSR correction
The effect of FSR is manifested by photon emission off the final state muon. It leads to change of the dimuon invariant mass and as a result a dimuon has invariant mass distinct from the propagator (or Z/gamma*) mass.
For our analysis we estimate the effect of FSR and the corresponding correction by estimating the bin-by-bin correction in invariant mass bins. Which is done by comparing the pre-FSR and the post-FSR spectra. The pre-FSR spectrum can be obtained by requiring mother of muon to be Z/gamma*, post FSR spectrum is when the mother is whatever.. The corresponding plots in the note are: 52-55 they all can be calculated with the information avaialble in the ntuple using
To get the FSR histograms one needs to turno on calculateFSR flag on.
Checkpoint: this macro will allow one to get plots 52-55 from the note
Step 9: Systematic uncertainty estimation
There are various sources of systematics affecting our analysis: the PDF, theoretical modeling uncertainty, efficiency estimation uncertainty, background estimation, unfolding etc.
For the background estimation, with the data driven method we estimate the systematic uncertainty as the difference between the result obtained with the method and that
expected from MC per mass bin. Corresponding numbers are obtained with the emu_prediction_plots.py
macro (see the recipe in the step 6 section).
PDF uncertainty estimation. The recipe for the method currently used (step by step).
Reweight the PDF using the current existing MC samples as implemented in CMSSW. First, check out the necessary packages:
then replace the LHAPDF library as described here to the current up-to-date one:
or you can directly change in:
with above path:
then change the input file in PdfSystematicsAnalyzer.py and run:
With the up-to-date LHAPDF, one can use CT10, MSTW2008*, CTEQ66, NNPDF2.0, and other PDF sets.
Efficiency estimation uncertainty. The current method for efficiency estimation in the DY analysis is following: we estimate the MC truth efficiency and then we apply the efficiency correction map (Pt-eta) extracted using the data-driven tag and probe method applied to data and MC to weight the MC events. The systematic uncertainty associated with the Tag-and-Probe efficiency estimation is due to line-shape modelling, the difference between fit and counting and due to the binning. The two first are calculated inside the macros described in Step5. The binning systematic uncertainty is estimated using the following macro:
it takes as input the root files having the histogram with efficiency correction as a function of invariant mass with two binnings (to estimate the binning uncertainty), the other sources of uncertainty are also accessed.
Step 10: Electron-muon combination with the BLUE method
Having the root files for individual cross section measurements i the dielectron and dimuon channels, we need to combine them for a higher precision. The combination is performed with the BLUE method, which takes 2 vectors of measured values of the cross section and the covariance matrices.
First, we need to make sure that the inputs are in the form the BLUE macro expects it (i.e. ASCII, not root):
We can use the txt2Plot.py macro to validate the txt input by visualizing it.
After we have the inputs in proper format, we just need to run the resultCombiner.C macro. To pass all the inputs properly (which should be in the current folder), we specify them in the wrapper.py script and run it as
The output will be the ASCII format again, but we normally need it in root. So we have to run another converter file after we finished:
After that, we have the root file with the cross section histogram of the same format as we have for the individual cross sections, and we can visualize it (produce a plot for the publication) on the same step as we did for other cross sections in the previous section
Step 10: Plotting the results
The main result of the measurement is the cross-section ratio or r (and R) shape. We distinguish R and r shapes (see the note chapter9 for details on the definition and also see Figures 64). The figure 64 shows the shape R for theory and measurement (for two independent trigger scenarios). It relies on the theoretical cross-section measurement (1-2GeV bin), the final numbers for acceptance correction and also the final numbers for cross-section measurement. To give a clearer feeling of what this plot depends on I name the tables that are used to produce the number in the plot 64:
To run the code one simply needs:
Use Gautier style macros to get the same plots with different style:
To get all the up to date values for the shape r/R use:
Among the requirements to style of the results presented is to put the measurement point to the weighted position (i.e. the location of the point inside the bin makes the integral over sub-bins equal from both sides). The following macro can be used to calculate these positions do in root: