A Three-Parameter Rank-Frequency Relation in Natural Languages
We present that, the rank-frequency relation in textual data follows $f \propto r^{-\alpha}(r+\gamma)^{-\beta}$, where $f$ is the token frequency and $r$ is the rank by frequency, with ($\alpha$, $\beta$, $\gamma$) as parameters. The formulation is derived based on the empirical observation that $d^2 (x+y)/dx^2$ is a typical impulse function, where $(x,y)=(\log r, \log f)$. The formulation is the power law when $\beta=0$ and the Zipf{--}Mandelbrot law when $\alpha=0$. We illustrate that $\alpha$ is related to the analytic features of syntax and $\beta+\gamma$ to those of morphology in natural languages from an investigation of multilingual corpora.
PDF Abstract