This paper develops the Parabel algorithm for extreme multi-label learning where the objective is to learn classifiers that can annotate each data point with the most relevant subset of labels from an extremely large label set. State-of-the-art 1-vs-All approaches, which learn a separate classifier per label, have yielded significantly higher prediction accuracies as compared to leading tree and embedding based methods. Unfortunately, 1-vs-All approaches have training and prediction costs that are linear in the number of labels making them prohibitively expensive for real-world applications such as Dynamic Search Advertising (DSA). Parabel addresses this limitation by: (a) efficiently learning a balanced label hierarchy from training data; (b) generalizing the popular hierarchical softmax model to the multi-label setting so as to obtain a probabilistic model of the joint label distribution given the learnt hierarchy and (c) developing logarithmic time training and prediction algorithms based on the proposed model. This allows Parabel to be up to 600-900x faster at training and up to 60-13,000x faster at prediction as compared to leading 1-vs-All approaches while maintaining classification accuracy. Experiments also revealed that Parabel could efficiently scale to DSA problems involving 7 million labels where it significantly increased the ad-recall and clicks when added to the system in production on the Bing search engine. Parabel's source code can be downloaded from The Extreme Classification Repository.