Overall completeness: 0%

Drell-Yan analysis Procedure

This twiki documents the most important steps of the Drell-Yan cross section measurement. It is intended to familiarize you with the technical aspects of the analysis procedure. 

Step 1: Producing ntuples

Completeness: 95%

DYToMuMuM-10To20 & Powheg-Pythia6 & CT10TuneZ2star 

DYToMuMuM-20 & Powheg-Pythia6 & CT10TuneZ2star 

DYToMuMuM-200 & Powheg-Pythia6 & TuneZ2star 

DYToMuMuM-400 & Powheg-Pythia6 & TuneZ2star 

DYToMuMuM-500 & Powheg-Pythia6 & TuneZ2star 

DYToMuMuM-700 & Powheg-Pythia6 & TuneZ2star 

DYToMuMuM-800 & Powheg-Pythia6 & TuneZ2star 

DYToMuMuM-1000 & Powheg-Pythia6 & TuneZ2star

DYToMuMuM-1500 & Powheg-Pythia6 & TuneZ2star 

DYToMuMuM-2000 & Powheg-Pythia6 & TuneZ2star 

DYToEEM-10To20 & Powheg-Pythia6 & CT10TuneZ2star 

DYToEEM-20 & Powheg-Pythia6 & CT10TuneZ2star 

DYToEEM-200 & Powheg-Pythia6 & TuneZ2star 

DYToEEM-400 & Powheg-Pythia6 & TuneZ2star 

DYToEEM-500 & Powheg-Pythia6 & TuneZ2star 

DYToEEM-700 & Powheg-Pythia6 & TuneZ2star 

DYToEEM-800 & Powheg-Pythia6 & TuneZ2star 

DYToEEM-1000 & Powheg-Pythia6 & TuneZ2star 

DYToEEM-1500 & Powheg-Pythia6 & TuneZ2star 

DYToEEM-2000 & Powheg-Pythia6 & TuneZ2star 

DYToTauTauM-10To20 & Powheg-Pythia6-tauola & TuneZ2star 

DYToTauTauM-20 & Powheg-Pythia6-tauola &CT10TuneZ2star 

WJetsToLNu & madgraph-tarball & TuneZ2star 

WWJetsTo2L2Nu &  madgraph-tauola & TuneZ2star 

WZJetsTo2L2Q &  madgraph-tauola& TuneZ2star 

WZJetsTo3LNu &  madgraph-tauola& TuneZ2star 

ZZJetsTo2L2Nu &  madgraph-tauola& TuneZ2star 

ZZJetsTo2L2Q &  madgraph-tauola& TuneZ2star 

ZZJetsTo4L &  madgraph-tauola& TuneZ2star 

TTMtt-700to1000 &  Powheg-tauola& TuneZ2star 

TTMtt-1000toInf  &  Powheg-tauola& TuneZ2star 

TTJetsFullLeptMGDecays &  madgraph& TuneZ2star 

TTJetsFullLeptMGDecays &  madgraph& TuneZ2star 

TT & Powheg-tauola & TuneZ2star 

TW & Powheg-tauola & TuneZ2star 

TbarW & Powheg-tauola & TuneZ2star 

QCDPt-15to20MuPt5Enriched & Pythia6 &TuneZ2star 

QCDPt-20to30MuPt5Enriched & Pythia6 &TuneZ2star 

QCDPt-30to50MuPt5Enriched & Pythia6 &TuneZ2star 

QCDPt-50to80MuPt5Enriched & Pythia6 &TuneZ2star

QCDPt-80to120MuPt5Enriched & Pythia6 & TuneZ2star

QCDPt-120to150MuPt5Enriched & Pythia6 &TuneZ2star 

QCDPt-150MuPt5Enriched & Pythia6 & TuneZ2star 

MC generation is 53X

cmsrel CMSSW_5_3_3_patch2
cd CMSSW_5_3_3_patch2/src
git cms-addpkg DataFormats/PatCandidates
git cms-addpkg PhysicsTools/PatAlgos
git cms-addpkg PhysicsTools/PatUtils
git clone git@github.com:ASvyatkovskiy/DYAnalysis DimuonAnalysis/DYPackage
scram b -j8
export DYWorkDir=$PWD/DimuonAnalysis/DYPackage
cd $DYWorkDir/ntuples

To simply perform a local test of the ntuple-maker run:

cmsRun ntuple_cfg.py

to produce the ntuples over full dataset use CRAB:

crab -create -submit -cfg crab.cfg
crab -get all -c <crab_0_datetime>

Step 3: Event Selection

Completeness: 60%

Once the ntuples are ready, one can proceed to the actual physics analysis. The first step of the analysis is the event selection. Currently, we use the so-called cut-based approach to discriminate between signal and background. For more on event selection read chapter 5 in the analysis note CMS-AN-13-420. Before starting to run a macro, set up the working area. Find all the necessary scripts in:

cd $DYWorkDir/test/ControlPlots

The code for event selection consists of 3 main files (and a few auxiliary). First of all the TSelector class which is customized for event selection used in a given analysis, necessary weights (pileup, FEWZ and momentum scale correction) are applied in the macro. The Monte-Carlo weights are also hardcoded inside the macro for each MC sample used. Next, is the wrapper ROOT macro which calls the TSelector to run on a given dataset. This wrapper is shown below, and explained step-by-step:

//macro takes 3 arguments, which are passed from the python script. These are: the histogram name (invariant mass, or for instance rapidity), ntuple weight or/custom (this option is deprecated - we always use custom weight), and the type of momentum scale correction (also deprecated - the correction does not depend on the run range in 8 TeV analysis)
void analyseYield(const char* WHICHHIST, const char* NTUPLEWEIGHT, const char* MOMCORRTYPE) {

  // Depending on the directory with data, the protocol used to access data will be different: "file" or "xrootd" are the most commonly used.
  TString protocol = "file://";
  //TString protocol = "root://xrootd.rcac.purdue.edu/";

  //Pointer to the location of the data used. Can be on /mnt/hadoop or on the scratch
  TString dirname = "/mnt/hadoop/store/group/ewk/DY2013/";

  // Next, the TFileCollection is created. This section is specific for each dataset: data or MC, so we prepare this wrapper macro for each sample
  TFileCollection* c1 = new TFileCollection("data","data");
  //Splitting criteria by runs/eras is happening here switch to RunAB, RunC, RunD. This is handy for studies of run dependencies 
  if (MOMCORRTYPE == "RunAB") c1->Add(protocol+dirname+"Data_RunAJan2013_Oct"+"/*.root");
  if (MOMCORRTYPE == "RunAB") c1->Add(protocol+dirname+"Data_RunBJan2013_Oct_p1"+"/*.root");
  if (MOMCORRTYPE == "RunAB") c1->Add(protocol+dirname+"Data_RunBJan2013_Oct_p2"+"/*.root");
  if (MOMCORRTYPE == "RunC1") c1->Add(protocol+dirname+"Data_RunCJan2013_Oct_p1"+"/*.root");
  if (MOMCORRTYPE == "RunC2") c1->Add(protocol+dirname+"Data_RunCJan2013_Oct_p2"+"/*.root");
  if (MOMCORRTYPE == "RunD1") c1->Add(protocol+dirname+"Data_RunDJan2013_Oct_p1"+"/*.root");
  if (MOMCORRTYPE == "RunD2") c1->Add(protocol+dirname+"Data_RunDJan2013_Oct_p2"+"/*.root");

  //Set the location of ProofLite Sandbox. It is more convenient to use the custom path rather than $HOME/.proof
  gEnv->SetValue("ProofLite.Sandbox", "<path to your working dir>/test/ControlPlots/proofbox/");
  //splitting criteria: how many worker nodes to use for the run: using more than 10-15 nodes usually will cause instability and lead to a crash subsequently
  TProof* p = TProof::Open("workers=20"); 
  p->RegisterDataSet("DATA", c1,"OV");
  //Deprecated - just leave as is, always
  TObjString* useNtupleWeightFlag = new TObjString(NTUPLEWEIGHT);
  p->AddInput(new TNamed("useNtupleWeightFlag",NTUPLEWEIGHT));

  //The histogram should always be "imvm" - it will give both 1D and 2D histograms. But if one needs to study N-1 selection, then the string should be the name of the cut to exclude
  TObjString* histogramThis = new TObjString(WHICHHIST);
  p->AddInput(new TNamed("histogramThis",WHICHHIST));
  //This is now useless, but for later studies it might become useful again, if there is a run dependency for the momentum scale correction
  TObjString* momCorrType = new TObjString(MOMCORRTYPE);
  p->AddInput(new TNamed("momCorrType",MOMCORRTYPE));

  p->SetParameter("PROOF_LookupOpt", "all");
  //This invokes the TSelector: "recoTree/DiMuonTree" is the name of the ROOT tree inside the file, "EventSelector_CP.C" is the name os the TSelector
#!/usr/bin/env python
from subprocess import Popen

#This normally is just "imvm", but for N-1 control plots like 18-25 in the AN-13-420 one needs to set to a custom cut name, for instance: 'relPFisoNoEGamma','chi2dof','trackerHits','pixelHits','CosAngle','muonHits','nMatches','dxyBS','relPFisoNoEGamma','vtxTrkProb','trigMatches','pT','eta']
histos = ['invm'] 
#normally one needs to run over all of them. Splitting to a set of runs is useful because loading very large number of files into one session can cause instability
eras = ['RunAB','RunC1','RunC2','RunD1','RunD2'] 
#Simply invoke ROOT wrapper macro using Popen
for run in eras:
    for hist in histos:
        Popen('root -b -l -q \'analyseYield.C(\"'+hist+'\",\"False\",\"'+run+'\")\'',shell=True).wait()

Once this is understood, one can run the macro. To produce plots like 35-37 use the analyse.py macro, which calls the wrapper for TSelector for the DY analysis (as described above):

mkdir runfolder
python analyseYield_mc.py
python analyseYield_data.py

Important information about the reweightings. Pileup reweighing is accessed from the ntuple, directly from the branch on a per event basis. The FEWZ weights are extracted from theoretical calculation, and are provided as arrays inside the efficiencyWeightToBin2012.C file located in the same directory (or any other directory, as long as there is an appropriate include in the header of the TSelector). The FEWZ weights are looked up based on the GEN mass as follows inside the code, only for signal MC:

//look up FEWZ weight

FEWZ_WEIGHT = weight(genDiMuPt, fabs(genRapidity), genMass, true);

To Finally, the Rochester momentum scale correction recipe is described here: http://www-cdf.fnal.gov/~jyhan/cms_momscl/cms_rochcor_manual.html

Few words about the normalization. The data events are not renormalized. The MC weights are weighted according to the probability of each event to be observed in a real collision event and according to the number of events generated in the sample. Therefore  

Event_weight ~ (Cross section x filter efficiency)/(Number of generated events)

For better accuracy we use the number of events actually ran on, rather than the number generated. We calculate it in the event loop, and apply it in the EventSelector::Terminate() method. In both the 7 and 8 TeV analysis, we normalized the MC tack (signal and backgrounds) to the number of events in data in the Z peak region (before the efficiency corrections). A special post-processing macro takes care of this:

python postprocessor.py
cp runfolder/stack* ../Inputs/rawYield

This python script adds up individual ROOT files with hadd and invokes ROOT macros parser.C and parser_2D.C which has a method for normalization of MC stack to data in the Z peak region.

After that, switch to the Dielectron working directory and produce necessary yield histograms before continuing with the style plotting.




After that, the style macro is used to plot the publication quality plots.

cd ../style/DY
root -l plot.C

the style macro is used This would plot the 1D yields distribution (the switch between the electrons and muons is done manually inside the macro by adjusting the paths).

To plot the 2D distributions do:

root -l ControlPlots_2D.C

Step 4: Acceptance and Efficiency estimation

Completeness: 100%

Another constituent of the cross-section measurement is the acceptance-efficiency.

To be able to produce the acceptance and efficiency one needs to change to a different folder, and run a different TSelector. But the general flow TSelector->ROOT wrapper->python wrapper is almost the same:

cd $DYWorkDir/AccEffMCtruth
python analyseMCtruth.py

The script will produce the root file with histograms corresponding to the mass and rapidity spectra after the acceptance cuts, selection cuts or both which are then used to calculate the acceptances, efficiencies and acceptance-efficiency products with and without pileup and FEWZ reweighing by executing:

root -l plotMCtruth.C
root -l plotMCtruth_2D.C

To get the corresponding distributions in the electron channel change to XX



The macro output a root file starting with out1* or out2* containing the histograms corresponding to the acceptance, efficiency and their product. To produce the publication level plots, the style macro described in the previous section needs to be used again

cd ../style/DY
root -l plot.C

To get the 2D plots do:

root -l plot_acc_2D.C

Step 5: Data-driven efficiency correction

Completeness: 15%

Next, the data-driven efficiency corrections are applied. This is done using the standard CMSSW recipe, so a lot of additional packages needs to be checked out. Follow this twiki: https://twiki.cern.ch/twiki/bin/viewauth/CMS/MuonTagAndProbe to set up your working area for the ntuple production (alternatively, one can use the trees already produced!)

cd TagAndProbe
cmsRun tp_from_aod_Data_newofficial.py


cmsRun fitMuonID_data_all_2011.py

After familiarizing yourself with the TagAndProbe package, you need to produce the muon efficiencies as a function of pT and eta. You do not need this in the analysis, but rather to understand if everything you are doing is correct. After you are done with that, produce the 2D efficiency pT-eta map (it is alredy produced in one go when running fiMuonID.py). To do that use the simple root macros (adjust i/o, not user friendly yet!):

root -l idDataMC_4xy.C
root -l triggerMuonDataMC_4xy.C

And to produce 2D efficiency maps and correction factors do:

 root -l perBinTable.C

The final step here is to produce the efficiency as function of invariant mass and the efficiency correction factor as a function of invariant mass.

cvs co UserCode/Purdue/DYAnalysis/AnalysisMacros/Correction
root -l efficiencyMass_newTmp.C
root -l correctionMass_newTmp.C

Note: you need to produce all the correction 2D maps on 2 previous steps, if you haven't succeeded you can use what we used for publication, txt files are located here:


Checkpoint3 With the macros describe in the step5 section it is possible to reproduce the following plots from the CMS-AN-11-013 note: 15-16, 39-42 and tables 11-12

Note: plot 40 was produced with LKTC method, code for which is currently not public and not possible to be retrieved from the authors. Currently (2011 data) the result is consistent with that obtained with Tag-And-Probe.

Step 6: Background estimation

QCD data driven background estimation

There are various methods employed to estimate the QCD background in a data-driven way (QCD is currently the only background estimated not from MC). The most important are the template fit method and the weight map method: carefully read chapter6 of the CMS-AN-11-013 for more details on the methods.

Reweighting method. First of all, read the quick description (in addition to what is written in the note) and also see this presentation .

cd ..
cvs co -d ControlPlots UserCode/Purdue/DYAnalysis/AnalysisMacros/QCDEstimation

There are few steps in this method. First of all, create a pT-eta weight look-up table indicating probability of a muon to be isolated as a function of muon pT-eta:

cd ControlPlots
root -l WeightMapFiller.C

The next step is to view the map, and to test it on the sample of dimuons and single muons:

root -l testWeightMapDouble.C
root -l testWeightMapSingle.C

Other methods used for the QCD background estimation in the note are the SS/OS pair method and template fit method (carefully read the note on the description!). For the SS-OS method, which uses the discriminative power of the isolation variable, considering classes of events having 2, 1or 0 isolated muon. You can get the plot by running:

cvs co UserCode/ASvyatkovskiy/QCD_Background
cd UserCode/ASvyatkovskiy/QCD_Background
root -l bgSS_OS.C

As for the template fit method:

root -l
.L auto_fits.C; fitAll()
#It needs input parameters - the important ones are
#bool etaBelow1_2 = false;
#bool fitFirstMu = true;
#bool singleMu = false;

Note: the original input files can be found at:


Checkpoint: this macros will allow one to reproduce the plots 45-48 from the note as well as tables 13-15 from the note

ABCD method

We estimate QCD background using ABCD method in order to improve our systematic uncertainty on the background estimation. ABCD method is very simple.

1) choose 2 variables: assume two variables are independent

2) assume the fraction should be same if there is no correlation: N_A / N_B = N_C / N_D

3) In our study, use two variables: sign of muon pair, muon isolation

4) QCD fraction in each region has a dependence. We produce the correction factor for each region: B, C, D

5) Produce N_B, N_C, N_D from data sample, and estimate N_A from them at the end (applying the correction factors)

In UserCode/Purdue/DYAnalysis/AnalysisMacros/ABCDmethod

QCDFrac.C: to produce correction factors for each region

ABCD2vari.C: to produce the ABCD results. The correction factors from the QCDFrac.C are plugged in this macro as an input.

ttbar data driven background estimation

We employ the so-called e-mu data driven background estimation method. See the following comprehensive talk for more details on the method. Currently the procedure to apply this method consists of 2 steps:

1) produce the root files with histograms

2) run the macros on the root files produced

For both steps one needs to check out the following tags:

V00-05-00      MuonAnalysis/Examples                            
V01-13-00      MuonAnalysis/MuonAssociators                     
V01-01-11      RecoVertex/VertexTools                           
V00-05-00      SHarper/HEEPAnalyzer                             
V00-11-00      SUSYBSMAnalysis/Zprime2muAnalysis                
V00-03-00      UserCode/Examples 

The highleted tags are important for step2).

Following is the description of how to produce the root files.

The mother script file is Zprime2muAnalysis/test/DataMCSpectraComparison/histos.py

Instructions related to this script file are at

The short instruction is this:

python histos.py submit testing no_data

or when you are ready

 python histos.py submit

Wait for root files to be done. Currently it is configured to have histograms with selection marked 'VBTF' as what we have in DY2011.

Below I describe the step2 in detail. Check out addtional macros, and copy them to your working directory:

cvs co UserCode/Purdue/DYAnalysis/AnalysisMacros/TTbarEstimation
cp UserCode/Purdue/DYAnalysis/AnalysisMacros/TTbarEstimation/*
cd SUSYBSMAnalysis/Zprime2muAnalysis/test/DataMCSpectraComparison

Make sure the paths to datafiles inside the macros are pointing to the location of the root files you have produced. To produce the control plots for emu and mumu mass spectra use

 python emu_basichists.py

To produce the correction factors run:

 python emu_corrfactors.py

And finally, the MC expectation vs. data driven method prediction plots are produced with:

 python emu_prediction_plots.py

A good agreement between data and MC for both the mumu and emu spectra is necessary for a method to work reliably.

Step 7: Unfolding

Unfolding is applied to correct for migration of entries between bins caused by mass resolution effects (FSR correction is taken into account as a separate step).  For use in the Drell-Yan analysis, the choice for unfolding is matrix inversion. Provides a common interface between channels for symmetry and ease in combination and systematic studies.

To do any unfolding with MC, this requires 3 things:

First, one can do some exercise, for that use script that demonstrates how the unfolding/fore-folding object works.

cvs co UserCode/kypreos/drellYan2010/unfolding
cd UserCode/kypreos/drellYan2010/unfolding/
source setup.sh
root -l test/testExpo.C++

To get back the pulls:

root -l test/testPulls.C++

The macros in the note are produced with the following:

 cvs co /UserCode/Purdue/DYAnalysis/Unfolding

1. To rpoduce the response matrix:

root -l unfoldingObs.C

2. To produce the unfolded yield plot do

root -l yield.C

Checkpoint7 with this macros one should be able to reproduce the plot 49-50 from the note and Tables 17-18 (note, the table 18 uses the background yield result from the background section)

Step 8: FSR correction

The effect of FSR is manifested by photon emission off the final state muon. It leads to change of the dimuon invariant mass and as a result a dimuon has invariant mass distinct from the propagator (or Z/gamma*) mass.

For our analysis we estimate the effect of FSR and the corresponding correction by estimating the bin-by-bin correction in invariant mass bins. Which is done by comparing the pre-FSR and the post-FSR spectra. The pre-FSR spectrum can be obtained by requiring mother of muon to be Z/gamma*, post FSR spectrum is when the mother is whatever.. The corresponding plots in the note are: 52-55 they all can be calculated with the information avaialble in the ntuple using

root -l InvMassFSR.C++

To get the FSR histograms one needs to turno on calculateFSR flag on.

Checkpoint: this macro will allow one to get plots 52-55 from the note

Step 9: Systematic uncertainty estimation

There are various sources of systematics affecting our analysis: the PDF, theoretical modeling uncertainty, efficiency estimation uncertainty, background estimation, unfolding etc.

For the background estimation, with the data driven method we estimate the systematic uncertainty as the difference between the result obtained with the method and that

expected from MC per mass bin. Corresponding numbers are obtained with the  emu_prediction_plots.py

macro (see the recipe in the step 6 section).

PDF uncertainty estimation. The recipe for the method currently used (step by step).
Reweight the PDF using the current existing MC samples as implemented in CMSSW. First, check out the necessary packages:

scramv1 p CMSSW CMSSW_4_2_3
cvsco -r CMSSW_4_2_3 ElectroWeakAnalysis/Utilities

then replace the LHAPDF library as described here to the current up-to-date one:
or you can directly change in:
with above path:

touch $CMSSW_BASE/src/ElectroWeakAnalysis/Utilities/BuildFile.xml
scramv1 b
cd ElectroWeakAnalysis/Utilities/test

then change the input file in PdfSystematicsAnalyzer.py and run:

cmsRun PdfSystematicsAnalyzer.py

With the up-to-date LHAPDF, one can use CT10, MSTW2008*, CTEQ66, NNPDF2.0, and other PDF sets.

Efficiency estimation uncertainty. The current method for efficiency estimation in the DY analysis is following: we estimate the MC truth efficiency and then we apply the efficiency correction map (Pt-eta) extracted using the data-driven tag and probe method applied to data and MC to weight the MC events. The systematic uncertainty associated with the Tag-and-Probe efficiency estimation is due to line-shape modelling, the difference between fit and counting and due to the binning. The two first are calculated inside the macros described in Step5. The binning systematic uncertainty is estimated using the following macro:


it takes as input the root files having the histogram with efficiency correction as a function of invariant mass with two binnings (to estimate the binning uncertainty), the other sources of uncertainty are also accessed.

Step 10: Plotting the results

The main result of the measurement is the cross-section ratio or r (and R) shape. We distinguish R and r shapes (see the note chapter9 for details on the definition and also see Figures 64). The figure 64 shows the shape R for theory and measurement (for two independent trigger scenarios). It relies on the theoretical cross-section measurement (1-2GeV bin), the final numbers for acceptance correction and also the final numbers for cross-section measurement. To give a clearer feeling of what this plot depends on I name the tables that are used to produce the number in the plot 64:

Table 21-24: Theoretical predictions

Tables 25-26: Measurement

Table 5-10: Acceptance-efficiency corrections

To run the code one simply needs:

cvs co UserCode/Purdue/DYAnalysis/AnalysisMacros/GautierMacro
cp UserCode/Purdue/DYAnalysis/AnalysisMacros/GautirMacro/* $CONTROL_PLOTS_DIR
root -l theory_plot.C++

Use Gautier style macros to get the same plots with different style:

root -l DY.C
root -l plot.C

To get all the up to date values for the shape r/R use:

cvs co UserCode/Purdue/DYAnalysis/AnalysisMacros/ShapeR./shapeDY.make

Among the requirements to style of the results presented is to put the measurement point to the weighted position (i.e. the location of the point inside the bin makes the integral over sub-bins equal from both sides). The following macro can be used to calculate these positions do in root:

.L compare_r.cc;

Useful links

A lot of intersting information can be retrieved from the Zprime JTERM SHORT and LONG exercises (which are constructed along the same lines as this tutorial).