Drell-Yan analysis Procedure
This twiki documents the most important steps of the Drell-Yan cross section measurement. It is intended to familiarize you with the technical aspects of the analysis procedure.
Step 1: Producing ntuples
- The CMSSW_53X MC samples are used for 8 TeV analysis. Below is the list of starting GEN-SIM-RECO samples used in the muon and electro analyses:
We use SingleMu and DoubleMu Primary Datasets (PD), January2013 ReReco version
- JSONs: Cert190456-2086868TeV22Jan2013ReRecoCollisions12JSON.txt, Jan22Jan2013
- Relevant software: CMSSW_5_3_3_patch2
To simply perform a local test of the ntuple-maker run:
to produce the ntuples over full dataset use CRAB:
Step 3: Event Selection
Once the ntuples are ready, one can proceed to the actual physics analysis. The first step of every analysis is the event selection. Currently, we use the so-called cut-based approach to discriminate between signal and background. For more on event selection please read chapter 3 in the analysis note CMS-AN-11-013. Before starting to run a macro, set up the working area. Find all the necessary scripts in:
and precisely follow the recipe below preserving folder structure recommended in the recipe:
- After you are done with creating chains (I assume you are in ./rootfiles directory) do:
Before running the macros, we need to fix few things which are changing frequently for our analysis:
- Mass range and binning:
- for the early stage of 2011 analysis we keep the 2010 binning [15,20,30,40,50,60,76,86,96,106,120,150,200,600]
- Trigger selection:
- See the presentation on event selection for 2011
- Thus, for 2011 we consider a combination of Double muon trigger and a combination of single isolated muon triggers can be used as a cross-check. Use three following combinations:
- HLT_Mu15, HLT_Mu24, HLT_Mu30
- HLT_IsoMu15, HLT_IsoMu17, HLT_IsoMu24
- DoubleMu6, DoubleMu7, Mu13_Mu8
- Offline selection: Baseline event selection has not changed compared to 2010 analysis, see
- we will consider moving to PF muons and PF isolation: this study is in progress right now
To produce the invariant mass plot do use the analyse2.C macro, which calls the TSelector for the DY analysis (called EventSelector):
The macro allows to run on multiple cores.
By performing minor changes inside the EventSelector one can calculate the efficiency weighted invariant mass distribution (which is used to estimate the corection factor as a function of invariant mass). Inside the EventSelector.C set
To produce dimuon kinematic distributions run
To produce other control plot (for all the event selection variables used in the analysis, as documented in the note), use:
There are few macros that help us to optimize the cuts. These macros calculate the statistical significance and the uncertainty on the cross-section. Statistical significance is defined as :
S = N_sig/sqrt(N_sig+N_bkg) and normally determined from MC. As you can infer, it scales with luminosity as ~sqrt(Lumi). There are other definitions of significance used in the analyses sometimes (see for instance CMS-TDR). To run, check out just two additional macros (I assume you didn't leave ./ControlPlots directory)
The first macro will create a txt file with an per mass bin values of signal and background. The second macro will histogram the output. These macros are adjusted to optimize the acceptance cuts, but with minimal changes it can optimize any other cut we use, and it is possible to change style to conform with the rest of plots in the note.
Note: root doesn not create output directories by itself so you should create a corresponding directory for output txt files like:
Q1: Check data/MC agreement for each plot, look for discrepancies.
Checkpoint1 With the macros described above you should be able to *reproduce* following plots from the CMS-AN-11-013: 1,3-14, 17-29,51.
Note: for the 23,25-29 macros have different style and were produce with PU sample.
Note: plots 20-22 are reproducible by optimization macros but have different style.
Step 4: Acceptance and Efficiency estimation
Another constituent of the cross-section measurement is the acceptance-efficiency.
- Acceptance is determined using GEN level information
How to run:
The script will produce the root file with histograms corresponding to the mass and rapidity spectra after the acceptance cuts, selection cuts or both which are then used to calculate the acceptances, efficiencies and acceptance-efficiency products with and without pileup and FEWZ reweighing by executing:
The macro output a root file starting with out1* or out2* containing the histograms corresponding to the acceptance, efficiency and their product.
Next, the data-driven efficiency corrections are applied. The details on the factorization and the application of correction factors are documented here , and can be found in this talk. With the current factorization scheme we measure four following efficiencies:
- Trigger, Reconstruction+ID, isolation:
- We use the officiela TagAndProbe package
How to run (on top of CMSSW 425 or later):
- The procedure goes in two steps:
- T&P tree production -> rerun seldom (ideally once), it depends only on the definitions of the tag and probe
- If you haven't produce TP trees you can always use the ready ones located there:
- fitting: separate job for trigger and all the muonID related efficiencies -> reran frequently and usually interactively (change binning, definitions)
- All the latest macros/configs can be found here: UserCode/ASvyatkovskiy/TagAndProbe
- Isolation: RandomCone - currently, code is private and not possible to use.
After familiarizing yourself with the TagAndProbe package, you need to produce the muon efficiencies as a function of pT and eta. You do not need this in the analysis, but rather to understand if everything you are doing is correct. After you are done with that, produce the 2D efficiency pT-eta map (it is alredy produced in one go when running fiMuonID.py). To do that use the simple root macros (adjust i/o, not user friendly yet!):
And to produce 2D efficiency maps and correction factors do:
The final step here is to produce the efficiency as function of invariant mass and the efficiency correction factor as a function of invariant mass.
Note: you need to produce all the correction 2D maps on 2 previous steps, if you haven't succeeded you can use what we used for publication, txt files are located here:
Checkpoint3 With the macros describe in the step5 section it is possible to reproduce the following plots from the CMS-AN-11-013 note: 15-16, 39-42 and tables 11-12
Note: plot 40 was produced with LKTC method, code for which is currently not public and not possible to be retrieved from the authors. Currently (2011 data) the result is consistent with that obtained with Tag-And-Probe.
Step 6: Background estimation
QCD data driven background estimation
There are various methods employed to estimate the QCD background in a data-driven way (QCD is currently the only background estimated not from MC). The most important are the template fit method and the weight map method: carefully read chapter6 of the CMS-AN-11-013 for more details on the methods.
There are few steps in this method. First of all, create a pT-eta weight look-up table indicating probability of a muon to be isolated as a function of muon pT-eta:
The next step is to view the map, and to test it on the sample of dimuons and single muons:
Other methods used for the QCD background estimation in the note are the SS/OS pair method and template fit method (carefully read the note on the description!). For the SS-OS method, which uses the discriminative power of the isolation variable, considering classes of events having 2, 1or 0 isolated muon. You can get the plot by running:
As for the template fit method:
Note: the original input files can be found at:
Checkpoint: this macros will allow one to reproduce the plots 45-48 from the note as well as tables 13-15 from the note
We estimate QCD background using ABCD method in order to improve our systematic uncertainty on the background estimation. ABCD method is very simple.
1) choose 2 variables: assume two variables are independent
2) assume the fraction should be same if there is no correlation: N_A / N_B = N_C / N_D
3) In our study, use two variables: sign of muon pair, muon isolation
4) QCD fraction in each region has a dependence. We produce the correction factor for each region: B, C, D
5) Produce N_B, N_C, N_D from data sample, and estimate N_A from them at the end (applying the correction factors)
QCDFrac.C: to produce correction factors for each region
ABCD2vari.C: to produce the ABCD results. The correction factors from the QCDFrac.C are plugged in this macro as an input.
ttbar data driven background estimation
We employ the so-called e-mu data driven background estimation method. See the following comprehensive talk for more details on the method. Currently the procedure to apply this method consists of 2 steps:
1) produce the root files with histograms
2) run the macros on the root files produced
For both steps one needs to check out the following tags:
The highleted tags are important for step2).
Following is the description of how to produce the root files.
The mother script file is Zprime2muAnalysis/test/DataMCSpectraComparison/histos.py
Instructions related to this script file are at
The short instruction is this:
or when you are ready
Wait for root files to be done. Currently it is configured to have histograms with selection marked 'VBTF' as what we have in DY2011.
Below I describe the step2 in detail. Check out addtional macros, and copy them to your working directory:
Make sure the paths to datafiles inside the macros are pointing to the location of the root files you have produced. To produce the control plots for emu and mumu mass spectra use
To produce the correction factors run:
And finally, the MC expectation vs. data driven method prediction plots are produced with:
A good agreement between data and MC for both the mumu and emu spectra is necessary for a method to work reliably.
Step 7: Unfolding
Unfolding is applied to correct for migration of entries between bins caused by mass resolution effects (FSR correction is taken into account as a separate step). For use in the Drell-Yan analysis, the choice for unfolding is matrix inversion. Provides a common interface between channels for symmetry and ease in combination and systematic studies.
To do any unfolding with MC, this requires 3 things:
- Producing the response matrix
- Making the histogram of measured events
- Making the true histogram (clearly not used/available when unfolding data)
First, one can do some exercise, for that use script that demonstrates how the unfolding/fore-folding object works.
To get back the pulls:
The macros in the note are produced with the following:
1. To rpoduce the response matrix:
2. To produce the unfolded yield plot do
Checkpoint7 with this macros one should be able to reproduce the plot 49-50 from the note and Tables 17-18 (note, the table 18 uses the background yield result from the background section)
Step 8: FSR correction
The effect of FSR is manifested by photon emission off the final state muon. It leads to change of the dimuon invariant mass and as a result a dimuon has invariant mass distinct from the propagator (or Z/gamma*) mass.
For our analysis we estimate the effect of FSR and the corresponding correction by estimating the bin-by-bin correction in invariant mass bins. Which is done by comparing the pre-FSR and the post-FSR spectra. The pre-FSR spectrum can be obtained by requiring mother of muon to be Z/gamma*, post FSR spectrum is when the mother is whatever.. The corresponding plots in the note are: 52-55 they all can be calculated with the information avaialble in the ntuple using
To get the FSR histograms one needs to turno on calculateFSR flag on.
Checkpoint: this macro will allow one to get plots 52-55 from the note
Step 9: Systematic uncertainty estimation
There are various sources of systematics affecting our analysis: the PDF, theoretical modeling uncertainty, efficiency estimation uncertainty, background estimation, unfolding etc.
For the background estimation, with the data driven method we estimate the systematic uncertainty as the difference between the result obtained with the method and that
expected from MC per mass bin. Corresponding numbers are obtained with the emu_prediction_plots.py
macro (see the recipe in the step 6 section).
PDF uncertainty estimation. The recipe for the method currently used (step by step).
Reweight the PDF using the current existing MC samples as implemented in CMSSW. First, check out the necessary packages:
then replace the LHAPDF library as described here to the current up-to-date one:
or you can directly change in:
with above path:
then change the input file in PdfSystematicsAnalyzer.py and run:
With the up-to-date LHAPDF, one can use CT10, MSTW2008*, CTEQ66, NNPDF2.0, and other PDF sets.
Efficiency estimation uncertainty. The current method for efficiency estimation in the DY analysis is following: we estimate the MC truth efficiency and then we apply the efficiency correction map (Pt-eta) extracted using the data-driven tag and probe method applied to data and MC to weight the MC events. The systematic uncertainty associated with the Tag-and-Probe efficiency estimation is due to line-shape modelling, the difference between fit and counting and due to the binning. The two first are calculated inside the macros described in Step5. The binning systematic uncertainty is estimated using the following macro:
it takes as input the root files having the histogram with efficiency correction as a function of invariant mass with two binnings (to estimate the binning uncertainty), the other sources of uncertainty are also accessed.
Step 10: Plotting the results
The main result of the measurement is the cross-section ratio or r (and R) shape. We distinguish R and r shapes (see the note chapter9 for details on the definition and also see Figures 64). The figure 64 shows the shape R for theory and measurement (for two independent trigger scenarios). It relies on the theoretical cross-section measurement (1-2GeV bin), the final numbers for acceptance correction and also the final numbers for cross-section measurement. To give a clearer feeling of what this plot depends on I name the tables that are used to produce the number in the plot 64:
To run the code one simply needs:
Use Gautier style macros to get the same plots with different style:
To get all the up to date values for the shape r/R use:
Among the requirements to style of the results presented is to put the measurement point to the weighted position (i.e. the location of the point inside the bin makes the integral over sub-bins equal from both sides). The following macro can be used to calculate these positions do in root: