A Three-Parameter Rank-Frequency Relation in Natural Languages

ACL 2020  ·  Chenchen Ding, Masao Utiyama, Eiichiro Sumita ·

We present that, the rank-frequency relation in textual data follows $f \propto r^{-\alpha}(r+\gamma)^{-\beta}$, where $f$ is the token frequency and $r$ is the rank by frequency, with ($\alpha$, $\beta$, $\gamma$) as parameters. The formulation is derived based on the empirical observation that $d^2 (x+y)/dx^2$ is a typical impulse function, where $(x,y)=(\log r, \log f)$. The formulation is the power law when $\beta=0$ and the Zipf{--}Mandelbrot law when $\alpha=0$. We illustrate that $\alpha$ is related to the analytic features of syntax and $\beta+\gamma$ to those of morphology in natural languages from an investigation of multilingual corpora.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here