It is standard to use income as the most relevant measure to estimate well-being. However, to obtain a comparable measure of income over time, it is necessary to deflate the nominal measures at each specific moment by a price series, most commonly the consumer price index (CPI). In the case of Argentina, in particular, the one used is that corresponding to the city of Buenos Aires and its metropolitan area. This a Laspeyres type index, with a fixed basket, and subject to a series of well-known biases.^{Footnote 5}

First, these indexes overestimate inflation, because they omit the effect of substitution between goods, changes in quality of the goods, and the impact of the availability of new products. Second, the use of a common price index may be a problem when building measures of income distribution, because it assumes that baskets are equivalent across all income groups.

In Argentina, consumption surveys are not very frequent. The last three were conducted in 1984–1985, 1996–1997, and 2004–2005, and where undertaken to update the basket in the CPI. However, the large time gap between updates may lead to significant biases, particularly if we consider the large structural changes undergone by the Argentina economy over the last 25 years (e.g., a large trade liberalization process).^{Footnote 6} Thus, correcting for the biases produced in the CPI can change the evolution of real income, and correcting for the biases at different income levels can also change the evolution of income distribution during this period.^{Footnote 7}

These consumption surveys can be used to estimate the biases following the methodology of Costa (2001) and Hamilton (2001). In a nutshell, the methodology uses the assumption that Engel curves for food should be relatively stable. If this is the case, when the estimation of the Engel curves at different dates shows shifts, it is assumed that these correspond to CPI bias. To illustrate the point, consider two points in time between which the share of food in income declines with a stagnant earnings level. Under the assumption that the Engel curve is stable, this provides a presumption that CPI may be biased (overestimated in this case) as a falling income share is consistent with *rising*, not stagnant, income levels. Thus, the changes in the share, with some assumptions, may be linked to the CPI bias. Of course, the biases in the Engel curve are obtained after correcting for changes in relative prices and household characteristics.

In later work, Carvalho Filho and Chamon (2012) use semi-parametric models to extend the methodology to estimate the biases at different income levels, thus allowing to tackle the issue of income distribution.

We should clarify that in the previous work, identification was built from exploiting the differences across regions. In the case of Argentina, however, our data contained only one area (the metropolitan area of the city of Buenos Aires). Thus, our paper needs to innovate from a methodological point of view relative to the previous work, by finding a way to obtain identification when only data from one region are available, something we do using individual price indexes by household.

### Estimating CPI biases

Following Costa (2001), the estimation strategy starts formally form the following equation:

$$w_{ijt} = \varphi + \gamma \left( {\ln P_{Fjt} - \ln P_{Njt} } \right) + \beta \left( {\ln Y_{ijt} - \ln P_{Gjt} } \right) + \sum\limits_{x} {\theta_{x} X_{ijt} } + \mu_{ijt}$$

(1)

where *w*
_{
ijt
} is the ratio of food to non-food of household *i*, in region *j* at time *t*; *P*
_{
Fjt
} is the true unobservable price of food in region *j* at time *t*; *P*
_{
Njt
} is the true and unobservable price of non-food in region *j* at time *t*; *Y*
_{
ijt
} is nominal income for household *i*, in region *j* at time *t*; *P*
_{
Gjt
} is the true and unobservable general price level in region *j* at time *t*; *X*
_{
ijt
} is a set of control variables for household *i*, in region *j* at time *t*; *μ*
_{
ijt
} is a random term; and *ϕ*, *γ*, *β,* and the different *θ*
_{
x
} are parameters.

If we call: \(\prod_{Gjt}\) is the cumulative percentage growth of the observable CPI in region *j*, since time 0 and time *t*; \(\prod_{Fjt}\) is the cumulative percentage growth of the price of food, in region *j*, between time 0 and time *t*; \(\prod_{Njt}\) is the cumulative percentage growth of the price of non-food, in region *j*, between time 0 and time *t*; *E*
_{
Gjt
} is the cumulative percentage increase in the measurement error in the CPI in region *j*, between time 0 and time *t*; *E*
_{
Fjt
} is the cumulative percentage increase in the measurement error in the price of food, in region *j*, between time 0 and time *t*; and *E*
_{
Njt
} is the cumulative percentage increase in the measurement error in the price of non-food, in region *j*, between time 0 and time *t*.We can rewrite (1) as

$$\begin{aligned} w_{ijt} &= \varphi + \gamma \left[ {\ln \left( {1 + \prod_{Fjt} } \right) - \ln \left( {1 + \prod_{Njt} } \right)} \right] + \beta \left[ {\ln Y_{ijt} - \ln \left( {1 + \prod_{Gjt} } \right)} \right]_{Gj0} \hfill \\ &\quad + \gamma \left[ {\ln P_{Fj0} - \ln P_{Nj0} } \right] - \beta \ln P + \gamma \left[ {\ln \left( {1 + E_{Fjt} } \right) - \ln \left( {1 + E_{Njt} } \right)} \right] \hfill \\&\quad - \beta \ln \left( {1 + E_{Gjt} } \right) + \sum\limits_{x} {\theta_{x} X_{ijt} } + \mu_{ijt} . \hfill \\ \end{aligned}$$

(2)

If we assume that the mismeasurement does not change across regions, we can rewrite (2) as

$$\begin{aligned} w_{ijt} & = \varphi + \gamma \left[ {\ln \left( {1 + \prod_{Fjt} } \right) - \ln \left( {1 + \prod_{Njt} } \right)} \right] + \beta \left[ {\ln Y_{ijt} - \ln \left( {1 + \prod_{Gjt} } \right)} \right] \hfill \\ &\quad+ \sum\limits_{j} {\delta_{j} D_{j} } + \sum\limits_{t} {\delta_{t} D_{t} } + \sum\limits_{x} {\theta_{x} X_{ijt} } + \mu_{ijt} \hfill \\ \end{aligned}$$

(3)

where \(D_{j}\) and \(D_{t}\) are dummies by regions and period, and

$$\delta_{j} = \gamma \left( {\ln P_{Fj0} - \ln P_{Nj0} } \right) - \beta \ln P_{Gj0}$$

(4)

$$\delta_{t} = \gamma \left[ {\ln \left( {1 + E_{Ft} } \right) - \ln \left( {1 + E_{Nt} } \right)} \right] - \beta \ln \left( {1 + E_{Gt} } \right).$$

(5)

Notice that \(\delta_{t}\) is a function only of time. If we additionally assume that the biases for food and non-food items are similar, we can compute a measure of the general CPI bias from

$$\ln \left( {1 + E_{Gt} } \right) = - \frac{{\delta_{t} }}{\beta }.$$

(6)

From (6), we can compute \(E_{Gt} = e{}^{{ - \frac{{\delta_{t} }}{\beta }}} - 1\) which is the measurement error between real inflation and CPI inflation. − *E*
_{
Gt
} is the cumulative bias.The assumption that the bias for food and non-food are the same is not necessarily very realistic. However, under reasonable assumptions, our measure can be considered a lower bound for the estimate. From (5)

$$\ln \left( {1 + E_{Gt} } \right) = \frac{{\gamma \left[ {\ln \left( {1 + E_{Ft} } \right) - \ln \left( {1 + E_{Nt} } \right)} \right]}}{\beta } - \frac{{\delta_{t} }}{\beta }.$$

(7)

If food is a basic good with an income elasticity less than one (*β* < 0) and if the income effect is larger than substitution effect for food consumption (*γ* < 0),^{Footnote 8} and under the reasonable assumption that the mismeasurement in non-food is larger than in food products, the first term in (7) is negative and our bias can be considered a lower bound. In other words, our measure would be underestimating the bias in the CPI.

So far, we have just described the estimation methodology used in the previous works. However, due to data limitations, we need to introduce some changes in the estimation procedure. Argentina has relatively few consumption expenditures that are publicly available and, as we mentioned above, we only had access to the Survey of Household Expenditures of 1985/1986 (Encuesta de Gasto de los Hogares 1985/86, EGH85/86), the National Survey of Household Expenditures 1996/1997 (Encuesta Nacional de Gasto de los Hogares 1996/97, ENGH 96/97), and National Survey of Household Expenditures 2004/2005 (Encuesta Nacional de Gasto de los Hogares 2004/05, ENGH 04/05). The EGH 85/86 took place in the city of Buenos Aires and its metropolitan area. For the ENGH 2004/05, we only have data for the city of Buenos Aires.

As a result, our data include only two regions, and thus, Eq. (3) becomes

$$\begin{aligned} w_{ijt}& = \varphi + \gamma \left[ {\ln \left( {1 + \prod_{Fjt} } \right) - \ln \left( {1 + \prod_{Njt} } \right)} \right] + \beta \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right)} \right] \hfill \\&\quad + \delta_{j} D_{j} + \sum\limits_{t} {\delta_{t} D_{t} } + \sum\limits_{x} {\theta_{x} X_{ijt} } + \mu_{ijt} \hfill \\ \end{aligned}$$

(8)

where *D*
_{
j
} equals one for households belonging to the city of Buenos Aires.

In the literature, identification is obtained from regional variations, and thus, *P*
_{
Fjt
} is the food price in region *j*, and *P*
_{
Gjt
} is the general price index in region *j*. This gives several observations for each moment in time allowing estimating the coefficient on the time dummy. Unfortunately, we cannot follow this procedure here, because we only have price indexes for the entire sample (Buenos Aires and its metropolitan area). Even if we would have the regional price indexes, that of only two neighbor regions is clearly not good enough to identify the price relative effect and time dummy.

Fortunately, while the specification assumes two types of goods, food and non-food, in reality, there are many goods within each of those categories. In the data, it is not feasible to compute a family specific food price index, but this is feasible for the non-food bundle. Thus, we construct a relative price between the food and non-food baskets at the household level. More precisely, we have that

$$P_{Nit} = \sum\limits_{k} {\lambda_{ik} P_{kt} }$$

(10)

where *λ*
_{
ik
} is the ratio of expenditure in item *k* over overall spending on non-food items, for household *i* at time *t*.

Considering that *λ*
_{
ik
} can be estimated from the individual data from the surveys, we can now rewrite (3) as

$$\begin{aligned} w_{ijt} &= \varphi + \gamma \left[ {\ln \left( {1 + \prod_{Ft} } \right) - \ln \left( {1 + \prod_{Nit} } \right)} \right] + \beta \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right)} \right] \hfill \\ &\quad+ \delta_{j} D_{j} + \sum\limits_{t} {\delta_{t} D_{t} } + \sum\limits_{x} {\theta_{x} X_{ijt} } + \mu_{ijt} \hfill \\ \end{aligned}$$

(11)

where (\(\prod_{Nit}\)) is the cumulative percentage growth of the price of non-food between time 0 and time *t* at the household level. This equation provides the estimates, as shown in Table 3.

A consequence of this strategy, however, is that the price index estimated at the family level may be correlated with the error term of the equation, and may pose an endogeneity problem, for example, if this price level is correlated with the taste for food. To deal with this problem, an alternative is to assign an arbitrary value for *γ* and then compute \(w_{ijt} - \gamma \left[ {\ln \left( {1 + \prod_{Ft} } \right) - \ln \left( {1 + \prod_{Nt} } \right)} \right]\) as the dependent variable to estimate the bias. This circumvents the need to use the individual price level altogether. However, where could we take this coefficient from? If we use the coefficient estimated in Eq. (1) from Table 3 (0.038), the total cumulative bias reaches 59.5%, which is very similar to the 61% from Table 3. But better still is to use an exogenous measure of this coefficient. Costa (2001) obtains a coefficient of 0.046 for the United States when identifying the effect of relative prices from differences in regions. Repeating the exercise with 0.046, the cumulative bias reaches 59.4%. Using twice the coefficient for the United States (0.092), the cumulative bias reaches 58.9%. The main reason why changes in the *γ* coefficient do not significantly alter the results is that relative prices have not changed too much. Figure 2 shows the evolution of the relative price of food in terms of the general level between 1985 and 2005.

Because the price of food in terms of the CPI has fallen about 10% between the first and second surveys, and only 4% between the first and the third, to significantly alter the results, the coefficient should be extremely large. For example, to reduce the cumulative bias to half (i.e., to about 30%), the coefficient should be more than 40 times the estimated coefficient for United States. In short, our results appear to be extremely robust, independently of the methodology adopted.

Trebon (2008) has suggested that economies of scale in each household may affect the share of food to non-food and suggests a correction based on introducing the household size interacted with the time dummies (that identify the bias). In other words, he suggests estimating

$$\begin{aligned} w_{ijt} &= \varphi + \gamma \left[ {\ln \left( {1 + \prod_{Ft} } \right) - \ln \left( {1 + \prod_{Nit} } \right)} \right] + \beta \left[ {\ln Y^{pc}_{it} - \ln \left( {1 + \prod_{Gt} } \right)} \right] \hfill \\ &\quad+ \delta_{j} D_{j} + \sum\limits_{t} {\delta_{t} D_{t} } + \sum\limits_{t} {\psi_{t} (D_{t} *hhsize)} + \sum\limits_{x} {\theta_{x} X_{ijt} } + \mu_{ijt} . \hfill \\ \end{aligned}$$

(12)

While Trebon finds that this correction reduced CPI biases by as much as a half relative to the findings in Costa (2001) and Hamilton (2001) for the US, Sect. 3 shows that in our case, this correction does not change things.

### Income distribution effects

Following Carvalho Filho y Chamon (2012), we explore also the possibility that the amount of bias may change along the Engel curve thus allowing estimating different mismeasurements in earnings growth for different income levels. Using a semi-parametric specification and assuming, as before, that the biases are the same for the food and non-food bundles, we have that

$$\begin{aligned} w_{ijt} &= \varphi + \gamma \left[ {\ln \left( {1 + \prod_{Ft} } \right) - \ln \left( {1 + \prod_{Nit} } \right)} \right] \hfill \\ &\quad+ f_{t} \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Git} } \right)} \right] + \sum\limits_{x} {\theta_{x} X_{ijt} } + \mu_{ijt} . \hfill \\ \end{aligned}$$

(13)

The function \(f_{t} \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Git} } \right)} \right]\) may be estimated non-parametrically using the differencing method of Yatchew (1997).To apply this method, we sort observations by income. The difference between two observations can be written as

$$\begin{aligned}& w_{ijt} - w_{i - 1jt} = \varphi + \gamma \left\{ {\left[ {\ln \left( {1 + \prod_{Ft} } \right) - \ln \left( {1 + \prod_{Nit} } \right)} \right] - \left[ {\ln \left( {1 + \prod_{Ft} } \right) - \ln \left( {1 + \prod_{Ni - 1t} } \right)} \right]} \right\} \hfill \\&\quad + f_{t} \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Git} } \right)} \right] - f_{t} \left[ {\ln Y_{i - 1t} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Gi - 1t} } \right)} \right] \hfill \\&\quad + \sum\limits_{x} {\theta_{x} \left( {X_{ijt} - X_{i - 1jt} } \right)} + \mu_{ijt} - \mu_{i - 1jt} . \hfill \\ \end{aligned}$$

(14)

As we have sorted by incomes, incomes are pretty similar so

$$\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Git} } \right) \cong \ln Y_{i - 1t} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Gi - 1t} } \right).$$

(15)

Assuming that \(f_{t}\) is a smooth function:

$$f_{t} \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Git} } \right)} \right] \cong f_{t} \left[ {\ln Y_{i - 1t} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Gi - 1t} } \right)} \right].$$

(16)

Therefore, Eq. (14) becomes

$$\begin{aligned} w_{ijt} - w_{i - 1jt}& = \varphi + \gamma \left\{ {\left[ {\ln \left( {1 + \prod_{Ft} } \right) - \ln \left( {1 + \prod_{Nit} } \right)} \right] - \left[ {\ln \left( {1 + \prod_{Ft} } \right) - \ln \left( {1 + \prod_{Ni - 1t} } \right)} \right]} \right\} \hfill \\&\quad + \sum\limits_{x} {\theta_{x} \left( {X_{ijt} - X_{i - 1jt} } \right)} + \mu_{ijt} - \mu_{i - 1jt} . \hfill \\ \end{aligned}$$

(17)

Note that Eq. (17) is a linear function [with coefficients identical to those of (13)], so that we can consistently estimate it by OLS, and construct the linear part of the prediction of *w*
_{
ijt
}, called \(\hat{w}_{ijt}\), to arrive to

$$w_{ijt} - \hat{w}_{ijt} = f_{t} \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Git} } \right)} \right] + \mu_{ijt} .$$

(18)

If we take the right side of Eq. (18) as a dependent variable, we can estimate Eq. (18) by any common non-parametric method, and we choose to estimate it by local weighted regression method.After estimating \(\hat{f}_{t}\), the cumulative bias may then be computed as the value of *E*
_{
Git
} that solves for each household *i* at time *t,* the following equation:

$$\hat{f}_{t} \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right) - \ln \left( {1 + E_{Git} } \right)} \right] = \hat{f}_{0} \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right)} \right].$$

(19)

Intuitively, we may think that if the function *f* is constant in time, the value of *f* for a given income level must be the same independently of the time period used for its estimation.

To estimate the cumulative bias for households at time *t,* we went through the following steps. First, we selected the real income of households at time 0 that had \(\hat{f}_{0}\) near the value estimated for each household at time *t* (that is \(\hat{f}_{t}\)). In fact, we selected two incomes at time 0 for each household at time *t (*those with income that were immediately higher and lower in terms of \(\hat{f}\)). Second, we computed the difference in real income between the two selected households. Third, we distributed linearly the difference according to the number of households from time *t* contained between the higher and lower bounds selected above (in terms of \(\hat{f}\)) from households at time 0. Fourth, we computed the real income from household in time *t* that it should have as per its share of food, adding to the income of lower (in terms of \(\hat{f}\)) the difference computed before. Fifth, we computed the bias from household *i* at time *t*, using the real income from household at time *t*, and the real income that it should as per its share of food. More precisely, what we do is to compute

$$E_{Git} = \exp \left[ {\ln Y_{it} - \ln \left( {1 + \prod_{Gt} } \right) - \left[ {\ln Y_{i0}^{{\hat{f}_{0}^{1} }} + \frac{{\left( {\ln Y_{i0}^{{\hat{f}_{0}^{2} }} - \ln Y_{i0}^{{\hat{f}_{0}^{1} }} } \right)}}{H}*h} \right]} \right] - 1.$$

(20)

Given that \(Y_{i0}^{{\hat{f}_{0}^{1} }}\) is the income of the household with the lowest closest \(\hat{f}_{0}\) to the household *i* at time *t*, and \(Y_{i0}^{{\hat{f}_{0}^{2} }}\) is the income of the household with the highest closest \(\hat{f}_{0}\) to the household *i* at time *t*, H is the number of households at time *t* that has an \(\hat{f}_{1}\) between \(\hat{f}_{0}^{1}\) and \(\hat{f}_{0}^{2}\) and \(h = 1 \ldots H\) is the order of these households sorted by \(\hat{f}\).