Be wary of stylized facts

Combinations of stylized facts from various papers can often produce curious conclusions.

In The Innovation Premium to Soft Skills, Philippe Aghion documents an interesting finding. The wage premium for low-skilled workers at an innovative firm is large, but the corresponding wage premium for high-skilled workers is non-existent.


In order to explain this finding, Aghion develops a model in which the complementarity between low-skilled and high-skilled workers is higher in firms that innovate more. Why might this be? Workers of all types have both “hard” and “soft” skills. Hard skills can be easily verified. (Did you in fact pass the California bar exam?) Soft skills take time to recognize. Think of the trustworthy butler at a wealthy person’s country home. It takes time for the home-owner to ascertain the trustworthiness of his butler. Once he realizes that he has a good butler, he may pay him much more and be loath to lose him.

The story makes perfect sense and the model rationalizes both the story and the empirical finding.

David Autor pointed out the following empirical fact: the urban wage premium has collapsed for workers in low-skilled occupations.

If we combine Autor’s and Aghion’s empirical facts together, we would naturally conclude that innovative firms do not concentrate in urban areas. If they did, then the wage premium for low-skilled workers in innovative firms would translate into an urban wage premium for the same workers. But this flies in the face of a third stylized fact: innovative firms tend to cluster in areas like San Francisco and Seattle.

Hence, our three stylized facts produce a contradiction.

There are a few ways out of this mess. The first is that Autor was looking at data in the United States and Aghion was looking at data in England. Maybe, then, there is no flattening of the urban wage premium for low-skilled workers in England. But Aghion’s model has nothing country-specific about it. If you take his model seriously, then it should also result in higher wages for low-skilled workers at innovative firms in the US, and if innovative firms cluster in urban areas, then there should be a positive urban wage premium in the US. So, if you believe all three stylized facts, you would have to conclude that Aghion has not identified the correct mechanism to explain his observation about wages in the UK.

Another option to reconcile these facts is that R+D spending may not in fact be a good proxy for how innovative a firm is. Aghion defines a firm as innovative if it engages in R+D, but John Haltiwanger likes to point out that Walmart is one of the world’s most innovative firms, but it spends nothing on R+D. But, this semantic solution is not satisfactory either. Tech firms disproportionately engage in R+D and they are disproportionately located in urban areas, so if the tech firms have a high degree of complementarity between low- and high-skilled workers, then we should see wages for low-skilled workers increasing in urban areas, which is of course not what we see.

The third possibility is that urban wages for low-skilled workers would actually be even lower in the absence of urban R+D-intensive firms. In the absence of these firms, the wage gradient for low-skilled workers wouldn’t just be flat: it would slope downward as population density increases and we move from rural to urban areas. Then we would have to ask: what is pushing urban wages lower than rural wages for the same work? Have the new minimum wage laws had no effect? Or are they not showing up yet in the data?

I don’t have a good answer for how to reconcile these three stylized facts, but it shows the peril of combining conclusions which are largely true when you slice the data one way with conclusions that are largely true when you slice the data another way. It reminds me of the many examples in probability theory that violate transitivity. Consider the following puzzle:

Is it possible to have random variables X,Y and Z for which simultaneously Prob(X>Y) > ½, Prob(Y>Z) > ½, and Prob(Z>X) > ½?

The answer is, surprisingly, yes. In fact, it’s even possible for Prob(X>Y) = .6,  Prob(Y>Z) = .6, and Prob(Z>X) = .6. This means that it is possible for three stylized facts to be true on average, even if they violate transitivity (and seem contradictory). But it also means that the stylized facts are not capturing a lot of other important information that paints a more detailed view of the phenomenon you are trying to explain.

Learning on the job (aka AEA conference part II)

In Production and Learning in Teams, Herkenhoff, Lise, Menzio and Phillips build a very sophisticated labor-search model with on-the-job learning. While the model is motivated by a single empirical fact, it is extremely flexible, allowing for many potential causal mechanisms and capturing a panoply of equilibrium outcomes in the labor market.


The motivating empirical fact is the following: When workers go through a spell of unemployment before finding a new job, a 10% increase in the coworkers’ average wage in the first job forecasts a 1.5% higher wage in the second job only for workers who are paid less than their coworkers in the first job. There is no such effect for workers who are paid more than their coworkers in the first job.

What could be happening here? Due to search frictions, a worker’s wage often lags behind her true human capital. Workers may be able to leverage an offer by an outside firm to negotiate a wage increase, but this does not happen every day and so wages are “sticky.” Therefore, on the day you leave your first job, you are likely to have a higher level of human capital than your wage indicates. The empirical fact suggests that human capital might be higher for those whose first job involved interacting with more skilled coworkers. It also suggests that human capital is not depreciated by colleagues with lower skill. In essence, workers learn from more skilled colleagues, but they do not “unlearn” from those with less skill.

An employment-to-employment transition suffers from the possibility that workers are leveraging connections at their current job in order to vault into a better paying position in the next job. That’s why the authors restrict their attention to employment-unemployment-employment transitions. They want to write a model where the starting wage reflects underlying human capital, rather than political or business connections, so they must restrict their attention to new jobs that are preceded directly by a spell of unemployment.


The model features a continuum of firms and a continuum of workers. The workers are indexed by levels of human capital, 1-7 with 7 being the highest. Firms can employ one or two workers. The production function is designed so that the output of a firm with two workers always exceeds the combined outputs of two one-person firms employing workers with the same level of human capital. This is logical : otherwise firms would never employ multiple workers. Depending on the parameterization, the production function can either be supermodular or submodular. Supermodularity means that, for firms employing workers of human capital x and y, f(x,x) + f(y,y) > 2*f(x,y) if x ≠ y.

There are four value functions, one for an unemployed worker, and three for workers in various states of employment. The value functions take into account the current value of a worker’s production as well as the probabilities of transitioning  to different states in the next period. As with most search models, there is no curvature in the utility function, so consumption = output = utility. Like many search models, a key variable is the “surplus,” which captures the combined value to a worker, a coworker, and a firm of employment. Wages are just a constant fraction of this surplus.

The model is able to capture a number of important features of the labor market:

·        Loss of skills: human capital stays the same or depreciates with some exogenous probability during a spell of unemployment.

·        Learning by doing: human capital stays the same or appreciates with some exogenous probability even in the absence of having a coworker.

·        Learning from coworkers: the probability of increasing your human capital depends positively on the difference between your coworker’s human capital and your own. The model is flexible enough to allow for non-linear rates of learning. You may learn faster in the presence of a more skilled co-worker than you would “unlearn” in the presence of a less-skilled coworker.

·        Divergent values for leisure: the value of home production (what you can do with your free time) depends on the level of human capital.            

·        Poaching of workers and negotiated salary increases.

Features of the equilibrium:

·        Firms slowly upgrade their workforce through letting on-the-job learning take place or by replacing current employees with new ones.

·        Firms lose employees for a variety of reasons: poaching by other firms, optimally choosing to replace a current employee with a new one, and exogenous transitions by the worker into unemployment or exiting the labor force altogether.

·        Due to search frictions the human capital of a worker may increase without the wage immediately reflecting that. The wage of a worker only reflects his human capital at the time of an initial hire or the last time he received a sufficiently attractive offer from a poaching firm.

·        After a spell of unemployment, the wage reflects the underlying human capital.

·        The production function is supermodular: firms like to match workers with similar levels of human capital. This results in positive assortative matching, which is inefficient relative to the social outcome. There is too little opportunity for workers to learn from each other when there is positive assortative matching, so the aggregate human capital accumulation in the economy is too low, and total output is too low.


The process by which the authors calibrate the model parameters is discussed in great detail. Some of the calibrations, such as targeting an average of 35 years in the labor force, are standard in the search-theoretic literature. Interesting moments they target in order to parameterize the learning process include: the relation of between-firm to within-firm wage variance (which reflects equilibrium sorting of workers in firms according to human capital), the rate at which workers take a new job if they are employed with more-skilled coworkers vs less-skilled coworkers (which reflects the potential for on-the-job learning from coworkers depending on their skill level), and life cycle wage growth (which reflects the underlying human capital accumulation).


A key question to ask any structural model is the extent to which the model’s predictions are “baked in” as a result of its assumptions. The model developed by Herkenhoff, Lise, Menzio and Phillips allows for a great deal of flexibility. Learning from colleagues can be non-linear.  There can also be no learning from colleagues. The model allows for the existence or lack thereof of learning-by-doing. While a parametric form for the production function must be assumed, the production process can either be super- or sub-modular. The production function is explicitly designed for the “knowledge economy” in that it depends only on human capital.

It would be interesting to see if a production function that included capital and allowed for some complementarities between capital and workers of various skills would generate the same conclusions. The assumption that firms employ no more than two workers does not seem overly restrictive. Most models that feature learning in teams still have all team members learning from only the most skilled person on the team (see Akcigit: Dancing with the Stars).

Some of the calibrated parameters don’t look that great to me. A discount rate of 15% a year is quite high. Many labor-search models suffer from this because of the lack of curvature in the utility function. Also, some of the moments that authors target do not seem to make a great fit. See below and judge for yourself.


The model does, however, make a couple of curious predictions and I suspect that they may be generated by an assumption about the number of human capital types. The model predicts that firms currently employing two very high-skilled workers will fire the more highly-skilled of the two and replace him with a medium-skilled worker. The model also predicts that if a firm encounters a highly skilled unemployed worker, it will fire its highest-skilled employee in order to hire the unemployed worker. The authors provide the following explanation: “The firm finds this optimal because the worker can teach more to the worst employee than to the best employee.” While this is true, I wonder if the results would still hold if there were no upper bound on human capital. In the model, a firm employing two workers with human capital equal to 7 will not experience any growth. These workers will neither be able to learn from each other nor learn by doing, because the model constrains human capital to top out at 7. If the model were changed to allow for 100 values for human capital rather than just 7, a firm currently employing two type-7 workers would have two additional options that are not available to it in the existing model: it could either hire a worker with human capital greater than 7 if it encounters one, or it could let its current employees “learn by doing.” Therefore, the firm might not want to fire the type-7 worker in order to hire a type 4-worker as the current model predicts. In real life, the potential for human capital is unbounded, so you would not want the boundedness assumption to be driving some of the model’s predictions.

Nevertheless, the paper is extremely impressive. The model is able to generate lots of interesting life-cycle patterns in wages as well as cross-sectional distributions in skills and assortative matching such as the one below:


There are a lot of fascinating potential extensions of the model, some of which the authors allude to in their conclusion. I have an additional interesting thought that they do not mention: How would equilibrium outcomes differ if workers had heterogenous bargaining power in the wage determination process? In the United States, some workers can negotiate a higher salary based on their unique and hard-to-replace skills. I suspect that allowing for heterogenous bargaining power among workers would reduce the value to a firm of employing a high-skilled worker and therefore would result in more negative assorting matching relative to the current model.

The causal effects of neighborhood quality (aka Heckman's favorite paper of 2018)

There is extensive sociological research by William Julius Wilson and others that strives to show that neighborhood quality matters for individual outcomes. The Moving to Opportunity (MTO) voucher program, started in 1994, aims to examine this hypothesis by randomly giving financial incentives to families who live in high-poverty neighborhoods to relocate to neighborhoods with less poverty. There is a robust literature on MTO that shows that so-called “voucher effects” - the impact on income/unemployment/welfare for those awarded an MTO voucher - are not significant.

So who is right? Do the sociologists have a nice theory that is not borne out by the data? Have the economists failed to capture the true effect of moving to a new neighborhood? Or is the housing voucher program not designed well enough to target those people who might actually benefit from it?

The Problem:

There are three groups in the MTO program: control, section 8, and experimental. Those receiving an experimental voucher are given a rent-subsidizing voucher to move from a high-poverty neighborhood to a low-poverty neighborhood. Those receiving a section 8 voucher can similarly move to either a low-poverty or middle-poverty neighborhood. The fundamental problem in evaluating the causal effect of moving from a high-poverty neighborhood to a middle- or low-poverty neighborhood is noncompliance.

Nearly 50% of families receiving a voucher did not use it to relocate. In addition, 21% of control families relocated to lower poverty neighborhoods despite not receiving a voucher. Across a broad range of covariates, those who use the voucher to relocate differ significantly from those who do not. This means that, while assignment to any of the three groups is random, there is a selection effect within any group. Due to this selection problem, prior research on has focused only on the causal effect of receiving a voucher for a given family (an intent-to-treat analysis). It should not be surprising, however, that this literature has found that voucher effects are insignificant: any estimates of voucher effects will include the null effects of those 50% who receive vouchers but do not relocate.

The Solution:

In Noncompliance as rational choice (2018), Rodrigo Pinto uses a brilliant identification strategy to show how noncompliance, which heretofore has been seen as a thorn in the side of researchers estimating causal treatment effects, can be used to nonparametrically identify the effect of treatment on the treated. Here’s how the identification strategy works.

Each household faces a multinomial choice problem. For any treatment group in which they may be placed (control, section 8, or experimental), the household can choose to stay in a high-poverty neighborhood, move to a middle-poverty neighborhood, or move to a low-poverty neighborhood. Because of the existence of a substantial degree of noncompliance, we can redefine this problem as identifying the unique types in the data.

Each type has a unique profile of counterfactual choices they would make depending on the treatment they  received. One type, for example, would choose to stay in the high-poverty neighborhood if they were in the control group, move to a medium-poverty neighborhood if they were in the section 8 group, and move to a low-poverty neighborhood if they were in the experimental group. Another type would stay in the high-poverty neighborhood regardless of the treatment group they are placed in. This is an extension to a 3 X 3 setting of Angrist, Imbens, and Rubin’s (1996) way of dividing the data into Always-Takers, Never-Takers, Compliers, and Defiers. Here, there are 27 (3^3) unique types.

The treatment effect on these 27 types is not identified because we do not observe the counterfactual outcomes. Only 9 total choices are observed in the data, three for each treatment group. One of the key innovations in this paper is to use a revealed preference approach to reduce the number of possible types from 27 to just 7 economically justifiable response types. How does the revealed preference approach work?

Consider an agent who is in the control group. If she chooses to move to a medium-poverty neighborhood, then revealed preference rules out any type where she gets a section 8 housing voucher and stays in a high-poverty neighborhood. This is because the section 8 voucher would augment her budget set if she moved to the middle-poverty neighborhood relative to when she is in the control group, and if she has preferences that are strictly monotonic in at least one good (an innocuous assumption), she would strictly prefer this augmented budget set. Therefore, three types are eliminated: (medium, high, high), (medium, high, low), and (medium, high, medium).*

We cannot, however, assume that she will move to a low-poverty neighborhood if she receives the experimental voucher. Remember that the experimental voucher subsidizes movement only to low-poverty neighborhoods. Since we only observe her choice when she is in the control group, we do not know which of the following relations capture her preferences: medium > high > low or low > medium > high.

The revealed preference approach reduces the 27 potential types to the following seven types, which are presented in the table below, sourced from the original paper:


The author shows that the elimination of 20 types embeds 9 monotonicity conditions whose validity he then tests using propensity scores. The estimated propensity scores satisfy each one of the 9 monotonicity conditions. As there are 336 total feasible combinations for the propensity score inequalities, the data strongly supports the use of the revealed preference approach.

Pinto is able to estimate the proportion of each type in the data, and, using some nifty matrix algebra, counterfactual outcomes for all seven economically justifiable types. He finds that two types are most common in the data, s1 and s4. He is also able to predict an individual’s type based on her covariates. For example, someone who has no teenagers, has moved in the past to seek schools, and indicates being a prospective mover, is more likely to be type s4 than the average recipient of the vouchers. Voucher recipients who do not relocate are more likely to have disabled household members and to have lived in their current neighborhood a long time. This suggests that we can rewrite the eligibility criteria for the MTO program to target those people who are mostly likely to make use of the voucher if they receive it.

At the beginning of this post, I asked whether the sociologists or the economists were right in their assessment of neighborhood effects. Pinto finds that moving from a high-poverty neighborhood to a low-poverty neighborhood yields a 14% increase in income, a 20% decrease in the likelihood of being unemployed, and a 38% increase in the chance of breaking out of poverty. All of these effects are statistically significant. This paper provides convincing evidence in support of the sociological view that neighborhood characteristics are important in determining individual outcomes and suggests that the prior emphasis on an intent-to-treat analysis does not do a good job of capturing these neighborhood effects.

The paper is available here.

*The first element of the ordered triple is the choice of the agent if placed in the control group, the second is the choice if placed in the section 8 group, and the third is the choice if placed in the experimental group.

Reflections on David Autor's presentation at the AEA annual meeting

At this year’s AEA annual meeting in Atlanta, David Autor gave a lecture on the changing geography of work. His main contribution was to disaggregate employment by skill level or education and show interesting trends in the location and wages of jobs. Here is one such figure:


The four graphs show a relatively consistent urban wage premium over time for workers with a college education, but a collapse in the urban wage premium for workers with no college. The near - constancy of wages in 2015 across all population densities for workers with low education actually hides some heterogeneity: men and women with only a high school education experience a small urban wage premium if they have mid-skill jobs, while those who have low-skill jobs experience no urban wage premium.  What are mid-skill jobs? They are primarily production, clerical, administrative and sales jobs,. and they have been vanishing from the economy.


Much ink has been spilt over declining geographic mobility in the US. Some observers place blame on the rise of occupational licensing. A hairdresser in rural Tennessee won’t just move to San Francisco to cut hair if becoming a hairdresser in California requires a significant investment in time and money to obtain a license. Others (see Hsieh and Moretti, 2015) blame housing regulations, which push up rents in productive cities, making them too expensive for potential newcomers. If only the rural Tennessean hairdresser could afford to move to SF, she might experience an increase in pay, living standards and opportunity.

The data presented by David Autor suggest that moving to San Francisco will do very little to help the Tennessean hairdresser. Low-skilled workers stand to gain very little, if anything, by moving from a rural area to an urban area even if the cost of living were the same in both areas. Only highly-skilled workers capture wage gains by moving to cities. Furthermore, the data suggests that, while there might be significant spillovers among high-skilled workers who live in the same urban area, there are little, if any, spillovers from high-skilled to low-skilled workers within an urban area. The tech workers in San Francisco, the financiers in New York, and the consultants in Washington are not pulling up the wages of the low-skilled workers in their cities.

Furthermore, if housing regulations are relaxed and rents decrease somewhat, we should expect the first people to move into cities to be the highly-skilled, because they stand to gain the most from the move. This will likely not reduce inequality in urban areas. If San Francisco can be made to have as cheap rents as rural America (and this seems impossible), it may become a rational choice for low-skilled workers to move to the city. Otherwise, San Francisco will continue to be populated largely by the rich.

Autor’s presentation can be accessed here.

About that supposed weak link between test-taking and research quality...

One often hears that exams have little correlation with, well, anything. It’s a matter of faith among grad students that our very difficult prelims at the end of the first year reveal little about our potential as researchers. Good researchers are different than good test-takers, or so we think.

It turns out that an interesting paper addresses this very question. In What does Performance in Graduate School Predict? Graduate Economics Education and Student Outcomes, a top group of economists run regressions of the predictive power of admissions rank on first-year exam scores, and then of the effect of first-year scores on job placement. The rank of the economics department where a newly-minted PhD is first hired is used as a proxy for research quality.

Before going into the findings, some important caveats are necessary.

The sample consists of only five schools: Harvard, MIT, the University of Chicago, Princeton and Stanford.

The data is for students who entered these five schools in the years between 1990 and 1999.

The R-squared of the regressions is particularly low, no higher than .12 for any specification. This means that regressors with significant coefficients are still not explaining very much of the variation in job placement.

It’s worth pointing out that some people who are stars and get hired at top-ranked PhD departments upon graduation sometimes don’t get tenure, while others take many years to develop a reputation as a top-quality researcher. Job-placement is therefore an imperfect proxy for research quality. The advantage of examining job placement, though, is that it is much easier to assess than the quality of a researcher’s output.

And now for some interesting findings…

First-year test scores are highly correlated across subjects.

Students from foreign undergraduate institutions perform significantly better on first-year exams than do students from American schools.

Women at these five schools did not perform worse in the job market than men. Conditional on first-year test scores, women were placed at slightly higher-ranked institutions upon graduating.

First-year test scores are a significant predictor of job placement. In fact, first-year scores in micro and macro are just about the only significant predictors of job placement in the sample. Having attended a top-ranked undergraduate American university was the other significant factor affecting job placement in the top twenty economics departments. This raises important questions. Are people who attend elite undergraduate schools in America benefitting from a form of nepotism in the job market whereby they are hired at the institution they went to school? Or is there a systematic difference between these students and those who don’t attend elite undergraduate colleges that is not captured by the other variables in the model?


One finding is hard to explain. Admissions rank helps predict first-year exam scores and exam scores help predict job placement, but admissions rank does not help predict job placement. In fact, admissions rank does not predict job placement even if it is the only variable in the model.

The regression results presented here go against the stories we tell ourselves in graduate school. There are three possible interpretations. The truth most likely is a combination of the three.

1.      The cynical view : exams serve as merely a signal of underlying characteristics such as intelligence and motivation. The fact that good test-takers end up becoming good researchers should not surprise us, because these same qualities are important in determining the quality of a researcher’s output.

2.      The Panglossian view : courses in economics graduate programs are carefully designed to teach material that is imperative for economic research. The exams are comprehensive and therefore serve as effective arbiters of a given student’s knowledge of the material. Students who do well on these exams have mastered economic theory, and this superior knowledge explains why they become good researchers.

3.      While both of the above explanations may contain a modicum of  truth, with such a low R-squared, these regressions still aren’t explaining very much. As the authors conclude:

“The difficulty predicting job placement may, in part, result from the noise in our data, ambiguity in ranking jobs, the incompleteness of our measures, and inherent randomness in the academic job market. Diligence, perseverance, and creativity—factors that surely matter for successful research careers and job placement—are difficult to define and measure. Our results suggest that there is not an easily recognizable star profile or single path to success for an economics graduate student.”

Paper available here:



Athey, Susan, Lawrence F. Katz, Alan B. Krueger, Steven Levitt, and James Poterba. "What does performance in graduate school predict? Graduate economics education and student outcomes." The American economic review 97, no. 2 (2007): 512-518.

Mathematical resources for economists

In this post, I review a number of the resources I’ve used to learn the mathematics required by PhD courses in economics. Some of these texts were required reading for UPenn’s math camp, while I have used others for self-study in the past. This reference list is incomplete and will be updated as I encounter new texts.

Real Analysis/Topology

UPenn’s summer math camp used the first five chapters of Real Mathematical Analysis, by Charles Chapman Pugh, to teach real analysis. I believe Penn uses this book because the material that shows up frequently in economics is compiled concisely in the book’s first five chapters and because the problems in Pugh are *very* difficult.  However, I was not particularly fond of this book, as it provides few examples to clarify the concepts it introduces. Also, there is little to no commentary explaining where its proofs come from and why they work. For the topological concepts that were not explained well in Pugh, the first three chapters of Topology by Munkres were a useful reference. Munkres is a great text. There are plenty of diagrams and examples to aid in understanding the material. The only problem from an economists’ point of view is that it does not cover real analysis. For that I recommend Sohrab’s Basic Real Analysis. Unlike the more famous Principles of Mathematical Analysis by Rudin (which offers mostly theorems and proofs but little discussion), Sohrab provides examples that build intuition and commentary to clarify the proofs. Sohrab explains real analysis more clearly and more thoroughly than other books I have seen.

In short, if you already know real analysis and topology and just want to refresh your memory and do some really hard problems, check out Pugh. If you want to understand topology, the first three chapters of Munkres are enough for most economists. If you are trying to teach yourself real analysis, read Sohrab.


Linear Algebra

There is no shortage of good books on linear algebra. Simon and Blume’s Mathematics for Economists has several chapters on linear algebra which provide a basic, but not quite sufficient, overview of everything you will use in your first year as a PhD student. Note that if you are using Simon and Blume, you also need to study the “Advanced Linear Algebra” topics at the back of the book (never fear, they aren’t particularly advanced). In particular, we used the Rank-Nullity Theorem a lot in math camp at Penn, and that isn’t covered until the end of Simon and Blume.

If you have more time, Elementary Linear Algebra, by Edwards and Penney, is an easy-to-understand book that covers more material than Simon and Blume. In particular, you get more practice with basis vectors, projection matrices, transformations, and eigenvalues. I would recommend this book to anyone who has not studied linear algebra before.  It is approachable enough that a diligent student could learn linear algebra from it without the aid of a teacher. Note that this book is out of print, but it can be found at many university libraries and through used book sellers on Amazon.


Univariate and Multivariate Calculus

Simon and Blume is more than sufficient for refreshing your memory about the computational side of calculus (taking total and partial derivatives, using the implicit and inverse function theorems, etc). Penn’s math camp, however, delved into a more technical examination of multivariate analysis. See the fifth chapter of Pugh (above) for this material.



I *highly* recommend A First Course in Optimization Theory by Rangarajam K. Sundaram. We followed chapters 2-9 in Penn’s math camp. Sundaram does an excellent job of explaining the Lagrange and Kuhn-Tucker methods, why they often work, and when they fail. He also explains concavity and quasiconcavity well, and he shows how determining that an objective function is concave or quasiconcave can allow you to relax some of the assumptions of the Kuhn-Tucker and Lagrange Theorems. The author works through many examples (often with unusual functional forms) to show why the Lagrange and Kuhn-Tucker methods usually yield solutions and why they sometimes don’t.

Simon and Blume also has several chapters devoted to optimization. The book covers a few things that Sundaram does not : bordered hessians, envelope theorems, and the intuition behind the sign of the Lagrange multiplier in equality- and inequality-constrained problems. The advantage of Simon and Blume is that the text is easy to read and the problems are not too hard.  Before I began studying economics at LSE, it was possible for me to teach myself the Lagrange and Kuhn-Tucker methods by reading Simon and Blume and working through the problems. For anyone who has lots of time, I would recommend studying chapters 16-19 of Simon and Blume and then reading chapters 2-9 of Sundaram to fill in any gaps in your understanding of optimization.

Also useful is this pdf explaining when to use Kuhn-Tucker and the intuition behind the complementary slackness conditions:

(highlight and copy the link into your browser)

Probability and Statistics

Penn’s math camp used the first five chapters of Statistical Inference by Casella and Berger to teach probability and statistics. The book is good because the authors work through several examples to explain each concept, and the examples use a variety of distributions (not just the normal distribution!) to help build the reader’s statistical fluency. There are also a ton of exercises to work through on your own, and the solutions manual is available online. However, the book makes for dense reading, and it is probably not the best resource for someone who last studied statistics several years ago.

If this is you, I recommend you read the review chapter of Christopher Dougherty’s Introduction to Econometrics. Among other things, this review chapter covers probability distributions, hypothesis testing, unbiasedness and consistency, and central limit theorems. The slides, which are available online, are a very useful accompaniment to the text and provide a good visual aid for things like how a central limit theorem works. The slides are available here:


Macro math

Good books on differential equations are a dime-a-dozen. There are three chapters on them in Sydsæter and Hammond’s Further Mathematics for Economics that are probably sufficient to begin studying at the PhD level. Dynamic Programming and Optimal Control are hard, and only about half of the students entering the PhD at Penn seem to have encountered them before. Upper-year students at Penn swear by Stokey, Lucas and Prescott’s Recursive Methods in Economic Dynamics, but I have yet to read it. The paper copied below by Robert Dorfman provides a good explanation of optimal control, but most students will probably need to work through a few problems in addition to reading the paper in order to understand how it works.


Lastly, for anyone who has already studied everything listed above and who just wants a resource filled with formulas and theorems useful to economists, there is the Economists Mathematical Manual by Sydsæter, Strøm, and Berck.



Blume, Lawrence, and Carl P. Simon. "Mathematics for economists." New York, London (1994).

Casella, George, and Roger L. Berger. Statistical inference. Duxbury/Thomson Learning, 2001.

Dorfman, Robert. "An economic interpretation of optimal control theory." The American Economic Review 59, no. 5 (1969): 817-831.

Dougherty, Christopher. Introduction to econometrics. Oxford University Press, 2016.

Edwards, C. H., and D. E. Penney. "Elementary Linear Algebra. 1988." Prince-Hall, Englewood Cliffs.

Munkres, James R. Topology. Prentice Hall, 2000.

Pugh, Charles Chapman. Real mathematical analysis. Vol. 2011. New York/Heidelberg/Berlin: Springer, 2002.

Rudin, Walter. Principles of mathematical analysis. Vol. 3. New York: McGraw-hill, 2006.

Lucas, R. E., and N. L. Stokey. "Recursive methods in dynamic economics." (1989).

Sohrab, Houshang H. Basic real analysis. Vol. 231. Birkhäuser, 2003.

Sundaram, Rangarajan K. A first course in optimization theory. Cambridge university press, 1996.

Sydsæter, Knut, Arne Strøm, and Peter Berck. Economists' mathematical manual. Vol. 3. New York, NY: Springer, 2005.

Sydsæter, Knut, Peter Hammond, and Atle Seierstad. Further mathematics for economic analysis. Pearson education, 2008.

A typical day during math camp

Now that five weeks of UPenn’s six-week math camp are over, I’ve managed to settle into a sort of weekly rhythm. For those of you who are looking into doing PhDs or are gearing up for your own respective math camps, I thought I would share my experience so you have an idea of what to expect.

Most weeks feature four hours of class every day between Monday and Thursday, followed by a two-hour quiz on Friday. My typical daily schedule is as follows.

Monday – Thursday:

8:15 am – 10 : Study

10 am – 12 noon : Class

12 - 1230 : Eat lunch

1230 - 130 : Study

130 - 330 : Class

330 - 430ish : BREAK. By this point in the afternoon my brain is usually pretty fried, so I can’t start studying immediately. I usually just take a walk during this time. Occasionally I indulge in a donut…

430 – 7 : Study

7-11 pm : Some combination of cooking, eating dinner, studying and sometimes going to the gym.


After the quiz on Friday, I take the afternoon off to do something fun. Usually I play basketball. Then, the weekend is not usually very intense. I probably study a combined 10-12 hours between Saturday and Sunday. While they cover a lot of material in math camp, I haven’t been inclined to go overkill and study 24/7.  I have an entire year of studying ahead of me, so I’m steeling myself for a marathon rather than trying to win a sprint through math camp.

Two weeks in the books

Two weeks of UPenn’s math camp are in the books!

We’ve covered a lot of material in the last two weeks, including:

Dedekind cuts, Cauchy Sequences, Cardinality ,Continuity, Metric Spaces, Norms, Inner Products, Boundedness, Open and Closed Sets, Completeness, Compactness, Coverings, Connectedness, Clusters, and Correspondences.

This material is NOT EASY. There isn’t always time for me to fully understand each concept before we have to move on to the next one. In particular, I have difficulty coming up with creative ways of defining sequences and epsilon-delta conditions. If I can solve a problem, it’s usually because I can think through each step logically and convince myself that it must be true. But, I need more practice formulating my logical arguments using specific mathematical language. At the moment, my proofs lack rigor because I don’t define the sets exactly right or I don’t specify the right radius for my open epsilon-ball. I’m hoping that stuff will come with time.

There are close to 40 students in math camp between the UPenn Economics, Wharton Finance, and Wharton Applied Economics programs. All but three of them have taken a course in Real Analysis before (I’m one of the three!), and many have taken courses in Topology. About half of the cohort has a serious math background, by which I mean that they have at least an undergraduate degree in math, and many have graduate degrees in math or statistics.

This all means that I’m not as well prepared for math camp as many of my colleagues. However, I benefited enormously from reading Velleman’s How to Prove It before I got here. From the very first day of class, it was clear that reading the book (and working through the exercises) had greatly enhanced my ability to comprehend the structure of the proofs I was reading, and break down my own proofs into intermediate goals. The book also helped by giving me plenty of practice with the contradiction and induction proofs which form the meat and potatoes of our course thus far.

I actually can’t recommend How to Prove It enough. I think it should form the basis of a required course for first-year college students in all majors. It teaches you how to logically order arguments. People who go on to do a major that requires a lot of writing would probably benefit indirectly from the training that the book provides. How to Prove It is especially valuable, though, if you plan on doing some graduate work involving mathematics.

That’s all for now. I’ve got to get back to studying

It begins

Tomorrow is my first day of math camp. I am starting a PhD in economics at the University of Pennsylvania in the fall, and the math camp is a “highly-recommended” program to prepare first-year PhD students for the rigors of the upcoming school year. Just a year ago, I was moving from Oslo, Norway to London. I had lived in Oslo for almost five years. I had been a freelance musician, playing with the Oslo Philharmonic, the Norwegian Radio Orchestra, the National Opera Orchestra, and other symphony orchestras in Norway. I had also played in a trio that gave performances and was active in commissioning new music. It is not easy to provide a satisfying answer to the question of why I left the world of music to pursue economics. I certainly never lost my love of classical music. Although it was easy to find some aspects of the classical music “industry” that I didn’t like, I didn’t alter my life trajectory over such quibbles. I think the main reason that I switched is that I wanted to apply my mind to a different set of problems than the ones I encountered in music.

So, one year ago, I made an abrupt change and moved to London to begin the one-year MSc program in economics at the London School of Economics. I had no idea that LSE would force me to spend nearly every waking hour of every day studying, and that I wouldstill feel clueless most of the time. Yet, I found myself mostly enjoying the program which had been memorably described as “not for the faint of heart.” I did not want my study of economics to finish after the one-year masters was over, so a PhD seemed to the logical next step.

Since the time I switched from music to economics, I’ve always been anxious about my mathematical background. I have a good foundation in linear algebra and a decent one in multivariate calculus from high school, but I took little math in college, and I’ve never taken any course such as real analysis that focused on proof-writing. I spent the months before I began LSE studying Simon and Blume’s Mathematics for Economists. This helped me to rebuild the mathematical skills that had atrophied after five years of playing music in Norway. After the one-year masters at LSE ended on May 30th, I had a month of free time before UPenn began. I decided I would read How to Prove It by Daniel J. Velleman. It develops the principles of mathematical logic and explains in very readable prose the techniques that mathematicians use to prove theorems. There are tons of practice proofs involving set theory. I felt I learned a huge amount from reading this book, but I will have to wait and see if it helps me survive math camp!

My goal is to update this blog every two weeks. I intend to write about my experiences as a PhD student at UPenn. As I learn more, I may write a few advice columns for prospective applicants about what it seems that admissions offices are after. Eventually, I may even take economics questions and try to answer them. My main goal for the next year, however, is to pass my prelims. Wish me luck!