I Introduction
Age and gender, two of the key facial attributes, play very foundational roles in social interactions, making age and gender estimation from a single face image an important task in intelligent applications, such as access control, humancomputer interaction, law enforcement, marketing intelligence and visual surveillance, etc [1].
Over the last decade, most methods used manuallydesigned features and statistical models [2, 3] to estimate age and gender [4, 5, 6, 7, 8, 9, 10], and they achieved respectable results on the benchmarks of constrained images, such as FGNET [11] and MORPH [12]. However, manuallydesigned features based methods behave unsatisfactorily on recent benchmarks of unconstrained images, namely “inthewild” benchmarks, including Public Figures [13], Gallagher group photos [14], Adience [15] and the apparent age data set LAP [16] for these features’ ineptitude to approach large variations in appearance, noise, pose and lighting.
Deep learning, especially deep Convolutional Neural Networks (CNN) [17, 18, 19, 20, 21, 22, 23, 24, 25, 26], has proven itself to be a strong competitor to the more sophisticated and highly tuned methods [27]. Although unconstrained photographic conditions bring about various challenges to age and gender prediction in the wild, we can still enjoy great improvements brought by CNNs [28, 29, 30, 35, 1]. The optimization ability of neural networks is critical to the performance of age and gender estimation, while existing CNNs designed for age and gender estimation only have several layers, which severely limit the development of age and gender estimation. Therefore, we construct a very deep CNN, Residual networks of Residual networks (RoR) [43]
, for age group and gender estimation in the wild. To begin with, we construct RoR with different residual block types, and analyze the effects of droppath, dropout, maximum epoch number, residual block type and depth in order to promote the learning capability of CNN. In addition, analysis of the characteristics of age estimation suggests two modest mechanisms, pretrained CNN by gender and weighted loss layer, to further increase the accuracy of age estimation, as shown in Fig.
1(a). Moreover, in order to further improve the performance and alleviate overfitting problem on small scale data set, we train RoR model on ImageNet firstly, and then finetune it on IMDBWIKI101 data set, thirdly, we use the model to further finetune on Adience data set. Fig. 1(b) shows the pipeline of our framework. Finally, through massive experiments on Adience data set, our RoR model achieves the new stateoftheart results on Adience data set.The remainder of the paper is organized as follows. Section II briefly reviews related work for age and gender estimation methods and deep convolutional neural networks. The proposed RoR age and gender estimation method and the two mechanisms are described in Section III. Experimental results and analysis are presented in Section IV, leading to conclusions in Section V.
Ii Related Work
Iia Age and gender estimation
In the past twenty years, human age and gender estimation from face image has benefited tremendously from the evolutionary development in facial analysis. Early methods for age estimation were based on geometric features calculating ratios between different measurements of facial features [44]. Geometry features can separate baby from adult easily but are unable to distinguish between adult and elderly people. Therefore, Active Appearance Model (AAM) based methods [11] incorporated geometric and texture features to achieve desired result. However, these pixelbased methods are not suitable for inthewild images which have large variations in pose, illumination, expression, aging, cosmetics and occlusion. After 2007, most existing methods used manuallydesigned features in this field, such as Gabor [4], LBP [45], SFP [5], and BIF [6]. Based on these manuallydesigned features, regression and classification methods are used to predict the age or gender of face images. SVM based methods [6, 15]
are used for age group and gender classification. For Regression, linear regression
[7], SVR [8], PLS [9], and CCA [10] are the most popular methods for accurate age prediction. However, all of these methods were only proven effective on constrained benchmarks, and could not achieve respectable results on the benchmarks in the wild [46, 15].Recent research on CNN showed that CNN model can learn a compact and discriminative feature representation when the size of training data is sufficiently large, so an increasing number of researchers start to use CNN for age and gender estimation. Yi et al. [28] first proposed a CNN based age and gender estimation method, MultiScale CNN. Wang et al. [29] extracted CNN features, and employed different regression and classification methods for age estimation on FGNET and MORPH. Levi et al. [30] used CNN for age and gender classification on unconstrained Adience benchmark. Ekmekji [31]
proposed a chained genderage classification model by training age classifiers on each gender separately. With the development of deeper CNNs, Liu et al.
[32] addressed the apparent age estimation problem by fusing two kinds of models, realvalue based regression models and Gaussian label distribution based GoogLeNet on LAP data set. Antipov et al. [33] improved the previous year’s results fusing general model and children model on LAP. Huo et al. [34] proposed a novel method called Deep Age Distribution Learning(DADL) to use the deep CNN model to predict the age distribution. Hou et al. [35]proposed a VGG16like model with Smooth Adaptive Activation Functions (SAAF) to predict age group on Adience benchmark. Then he used the exact squared Earth Mover’s Distance(EMD2)
[36]in loss function for CNN training and obtained better age estimation result. VGG16 architecture and SVR
[37] were used for age estimation on top of the CNN features. Deep EXpectation (DEX) formulation [1] was proposed for age estimation based on VGG16 architecture and a classification followed by a expected value formulation, and it got good results on FGNET, MORPH, Adience and LAP data sets. Iqbal et al. [38] proposed a local face description, Directional AgePrimitive Pattern(DAPP), which inherits discernible aging cue information and achieved higher accuracy on Adience data set. Recently, Hou et al. used the RSAAFc2+IMDBWIKI [39] method, and achieved the stateoftheart results on Adience benchmark.IiB Deep convolutional neural networks
It is widely acknowledged that the performance of CNN based age and gender estimation relies heavily on the optimization ability of the CNN architecture, where deeper and deeper CNNs have been constructed. From 5conv+3fc AlexNet [17] to the 16conv+3fc VGG networks [21] and 21conv+1fc GoogleNet [25], then to thousandlayer ResNets, both the accuracy and depth of CNNs were promptly increasing. With a dramatic rise in depth, residual networks (ResNets) [26]
achieved the stateoftheart performance at ILSVRC 2015 classification, localization, detection, and COCO detection, segmentation tasks. Then in order to alleviate the vanishing gradient problem and further improve the performance of ResNets, Identity Mapping ResNets (PreResNets)
[47]simplified the residual networks training by BNReLUconv order. Huang and Sun et al.
[48] proposed Stochastic Depth residual networks (SD), which randomly dropped a subset of layers and bypassed them with shortcut connections for every minibatch to alleviate overfitting and reduce vanishing gradient problem. In order to dig the optimization ability of residual networks family, Zhang et al. [43] proposed Residual Networks of Residual Networks architecture (RoR), which added shortcuts level by level based on residual networks, and achieved the stateoftheart results on lowresolution image data sets such as CIFAR10, CIFAR100 [49] and SVHN [50] at that time. Instead of sharply increasing the feature map dimension, PyramidNet [40] gradually increases the feature map dimension at all units and gets superior generalization ability. DenseNet [41] uses densely connected paths to concatenate the input features with the output features, and enables each microblock to receive raw information from all previous microblocks. To enjoy the benefits from both path topologies of ResNets and DenseNet, Dual Path Network [42] shares common features while maintaining the flexibility to explore new features through dual path architectures.Iii Methodology
In this section, we describe the proposed RoR architecture with two modest mechanisms for age group and gender classification. Our methodology is essentially composed of four steps: Constructing RoR architecture for improving optimization ability of model, pretraining with gender and training with weighted loss layer for promoting the performance of age group classification, pretraining on ImageNet and further finetuning on IMDBWIKI101 data set for alleviating overfitting problem and improving the performance of age group and gender classification. In the following, we describe the four main components in detail.
Iiia Network architecture
RoR [43] is based on a hypothesis: The residual mapping of residual mapping is easier to optimize than original residual mapping. To enhance the optimization ability of residual networks, RoR can optimize the residual mapping of residual mapping by adding shortcuts level by level based on residual networks. By experiments, Zhang et al. [43] argued that the optimization ability of PreRoR is better than RoR with the same number of layers, so we choose PreRoR in this paper except pretraining on ImageNet or IMDBWIKI.
In order to train the highresolution Adience data set, we first construct RoR based on the basic PreResNets for Adience, and denote this kind of RoR as PreRoR. PreResNets [47] include two types of residual block designs: basic residual block and bottleneck residual block. Fig. 2 shows the PreRoR with basic block constructed based on original PreResNets with basic blocks. The shortcuts in these original residual blocks are denoted as the finallevel shortcuts. To start with, we add a shortcut above all basic blocks, and this shortcut can be called root shortcut or firstlevel shortcut. We use 64, 128, 256 and 512 filters sequentially in the convolutional layers, and each kind of filter has different number (, respectively) of basic blocks which form four basic block groups. Furthermore, we add a shortcut above each basic block group, and these four shortcuts are called secondlevel shortcuts. Then we can continue adding shortcuts as the innerlevel shortcuts. Lastly, the shortcuts in basic residual blocks are regarded as the finallevel shortcuts. Let denote a shortcut level number. In this paper, we choose level number =3 according to the analysis of Zhang et al. [43], so the RoR has rootlevel, middlelevel and finallevel shortcuts, shown in Fig. 2.
The junctions which are located at the end of each residual block group can be expressed by the following formulations.
(1) 
where and are input and output of the th block, and is a residual mapping function, and are both identity mapping functions. expresses the identity mapping of firstlevel and secondlevel shortcuts, and denotes the identity mapping of the finallevel shortcuts. function is type B projection shortcut.
For bottleneck block, He al et. [47] used a stack of three layers instead of two layers that first reduce the dimensions and then reincrease it. Both basic block and bottleneck block have similar time complexity, so we can get deeper networks easily through bottleneck. In this paper, we also construct a PreRoR based on bottleneck PreResNets. The architecture details of PreRoR with bottleneck blocks are shown in Fig. 3. We use to control the output dimensions of the blocks. He et al. [47] chose
=4 led to the results that the input and output planes of these shortcuts are very different. Since the zeropadding (Type A) shortcut will bring more deviation and projection (Type B) shortcut will aggravate overfitting, our RoR adopts
=4, =2 and =1 in this paper.IiiB Pretraining with gender
Like face recognition, age estimation can be easily affected by many intrinsic and extrinsic factors. Some of the most important factors include identity, gender and ethnicity, together with other factors like Pose, Illumination and Expression (PIE). We can alleviate the effects of these factors by using large data sets in the wild, but the existing data sets for age estimation are generally relatively small. To some extent, gender affects age judgments. On the one hand, the aging process of men slightly differs from women due to different longevity, hormones, skin thickness, etc. On the other hand, women are more likely to hide their real age by using makeup. So realworld age estimations for men and women are not exactly the same. Guo et al.
[10] and Ekmekji [31] first manually separated the data set according to the gender labels, then trained an age estimator on each subset separately. Inspired by this, we train CNN by gender initially, then replace the gender prediction layer with age prediction layer, and finetune the whole CNN structure at last.IiiC Training with weighted loss layer
There are some diversities lying between general image classification and age estimation. Firstly, the different classes in general image classification are uncorrelated, but the age groups have a sequential relationship between labels. These interrelated age groups are more difficult to distinguish. Secondly, human aging processes show variations in different age ranges. For example, aging processes between midlife adults and children are not equivalent. In this paper, we will analyze the law of human aging, and do age estimation under its guidance. For human, it is easier to distinguish who is the older one out of two people than to determine the persons’ actual ages. Based on this characteristic and ageordered groups, we define , =1,2…,, where is the number of age group labels. Then for a given age group , we separate the data set into two subsets and as follows:
(2) 
Next, we use the two subsets to learn a binary classifier that can be considered as a query: “Is the face older than age group ?” There are eight classes (02, 46, 813, 1520, 2532, 3843, 4853, 60) in Adience data set, so we can choose =1,2,…,7. By doing so, we get seven binaryclass data sets, and the results of these binary classifiers can form a human aging curve which represents the human aging process. We execute some experiments on folder0 of Adience data set with 4c2f CNN described in [30] (just using two classes instead of eight classes), and the aging curve is described in Fig. 4 We discover that the 4th, 5th and 6th results are smaller than the others. As a conclusion, the aging process of smaller and greater age group is faster than intermediate age groups, so it is harder to distinguish intermediate age groups comparing to smaller and greater age groups.
Name  Loss Weight Distribution 

LW0  (1,1,1,1,1,1,1,1) 
LW1  (1,1,1,0.9,0.8,0.8, 0.9,1) 
LW2  (1,1,1,1.1,1.2,1.2,1.1,1) 
LW3  (1,1,1,1.3,1.5,1.5,1.3,1) 
Through above analysis, we realize the 4th, 5th, 6th and 7th groups are more difficult to estimate, so we apply higher loss weights to these age groups. Thus, we define four different settings of loss weight distributions for optimal results, as shown in Table I.
IiiD Pretraining on ImageNet
Due to using small scale data sets for age and gender estimation, the overfitting problem is easy to occur during training, so we use RoR network training ImageNet data set to obtain the basic feature expression model firstly. And then we use the pretrained RoR model to finetune on the Adience data set, so as to alleviate the overfitting problem brought by the direct training on Adience.
The preceding data sets using RoR were all small scale image data sets, in this paper we first conduct experiments on large scale and highresolution image data set, ImageNet. We evaluate our RoR method on the ImageNet 2012 classification data set [51], which contains 1.28 million highresolution training images and 50,000 validation images with 1000 object categories. During training of RoR, we notice that RoR is slower than ResNets. So instead of training RoR from scratch, we use the ResNets models from [52] for pretraining. The weights from pretrained ResNets models remain unchanged, but the new added weights are initialized as in [53]. In addition, SD is not used here because SD makes RoR difficult to converge on ImageNet. Then we replace the 1000 classes prediction layer with age and gender prediction layer, and finetune the whole RoR structure on Adience.
IiiE Finetuning on IMDBWIKI101
In order to make the RoR model further learn the feature expression of facial images and also reduce the overfitting problem, we use largescale face image data set IMDBWIKI101 [1] to finetune the model after pretraining on ImageNet.
IMDBWIKI is the largest publicly available data set for age estimation of people in the wild, containing more than half million images with accurate age labels, whose age ranges from 0 to 100. For the IMDBWIKI data set, the images were crawled from IMDb and Wikipedia, where IMDB contains 460723 images of 20,284 celebrities and Wikipedia contains 62328 images. As the images of IMDBWIKI data set are obtained directly from the website, the IMDBWIKI data set contains many lowquality images, such as human comic images, sketch images, severe facial mask, full body images, multiperson images, blank images, and so on. The example images are shown in Fig. 5. Those bad images seriously affect the network learning effect. Therefore, in this paper, we spend a week manually removing the low quantity images by four people. In our removing process we mainly consider: a) the bad images, which are not standard face images from the IMDBWIKI data set and b) the images with wrong age labels, especially the age images from 0 to 10 years old. The remaining IMDBWIKI dataset remains 440607 images. The data set after cleaning is divided into 101 classes representing the age of each age, which we name IMDBWIKI101 data set.
Firstly, we replace the 1000 classes prediction layer on ImageNet with 101 classes prediction layer for age prediction, and finetune the RoR structure on IMDBWIKI101. When finetuning the RoR model, the IMDBWIKI101 data set is randomly divided into 90% for training and 10% for testing. Then we replace the 101 classes prediction layer with age and gender prediction layer, and finetune the whole RoR structure on Adience.
Iv Experiments
In this section, extensive experiments are conducted to present the effectiveness of the proposed RoR architecture, two mechanisms, pretraining on ImageNet and further finetuning on IMDBWIKI101 data set. The experiments are conducted on unconstrained age group and gender data set, Adience [15]. Firstly, we introduce our experimental implementation. Secondly, we empirically demonstrate the effectiveness of two mechanisms for age group classification. Thirdly, we analyze different PreRoR models for age group and gender classification. Fourthly, we improve the performance of age and gender estimation by pretraining on ImageNet with RoR models. Furthermore, the RoR model are finetuned on IMDBWIKI101 data set for learning the feature expression of face images. Finally, the results of our best models are compared with several stateoftheart approaches.
Iva Implementation
For Adience data set, we do experiments by using 4c2fCNN [30], VGG [21], PreResNets [47], our PreRoR architectures, respectively.
4c2fCNN: The CNN structure described in [30] is denoted as baseline for the experiments with two mechanisms. Compared to the original 4c2fCNN in [30]
, our baseline adds preprocessing of data by subtracting the mean and dividing the standard deviation.
VGG: We choose VGG16 in [21] to construct age group and gender classifiers.
PreResNets: We use PreResNets34, PreResNets50 and PreResNets101 in [47] as the basic architectures.
PreRoR: We use the basic block and bottleneck block PreResNets in [47] to construct RoR architecture. The original PreResNets contain four groups (64 filters, 128 filters, 256 filters and 512 filters) of residual blocks, the feature map sizes are 56, 28, 14 and 7, respectively. PreRoR with basic blocks includes PreRoR34 (34 layers), PreRoR58 (58 layers) and PreRoR82 (82 layers). PreRoR with bottleneck blocks includes RoR50 (50 layers) and RoR101 (101 layers). Each residual block group in different PreRoR has different number of residual blocks, as shown in Table II. PreRoR contains four middlelevel residual blocks (every middlelevel residual block contains some finallevel residual blocks) and one rootlevel residual block (the rootlevel residual block contains four middlelevel residual blocks). We adopt BNReLUconv order, as shown in Fig. 2 and Fig. 3.
Block Type  Number of Layers  Number of blocks in each Group 

Basic Block  34  3, 4, 6, 3 
Basic Block  58  5, 6, 12, 5 
Basic Block  82  7, 8, 14, 7 
Bottleneck Block  50  3, 4, 6, 3 
Bottleneck Block  101  3, 4, 23, 3 
Our implementations are based on Torch 7 with one Nvidia Geforce Titan X. We initialize the weights as in
[26]. We use SGD with a minibatch size of 64 for these architectures except PreRoR with neckbottle block where we use minibatch size 32. The total epoch number is 164. The learning rate starts from 0.1, and is divided by a factor of 10 after epoch 80 and 122. We use a weight decay of 1e4, momentum of 0.9, and Nesterov momentum with 0 dampening
[52]. For stochastic depth droppath method, we set with the linear decay rule of = 1 and =0.5 [48].The entire Adience collection includes 26,580 256256 color facial images of 2,284 subjects, with eight classes of age groups and two classes of gender. Testing for both age and gender classification is performed using a standard fivefold, subjectexclusive crossvalidation protocol, defined in [15]. We use the inplane aligned version of the faces, originally used in [54]. For data augmentation, VGG, PreResNets and PreRoR use scale and aspect ratio augmentation [52] instead of scale augmentation used in 4c2fCNN.
IvB Effectiveness of two mechanisms
In this section, we do age group classification experiments on folder0 of Adience data set with two mechanisms based on 4c2fCNN architecture, and the results are described in Fig. 6. Here, we report the exact accuracy(correct age group predicted) and 1off accuracy (correct or adjacent age group predicted) as [15].
Previously, we use 4c2fCNN with each mechanism individually. In Fig. 6, 4c2fCNN pretraining by gender (4c2fCNNpt) achieves apparent progress compared to 4c2fCNN without pretraining. And then, Fig. 6 also shows that 4c2fCNN with loss weight distribution LW3 (4c2fCNNLW3) achieves best performance among all the loss weight distributions on folder0 of Adience data set, so we will choose LW3 as the loss weight distribution in the following experiments. Finally, we combine above the two mechanisms to predict age group and Fig. 6 shows that 4c2fCNN combined of pretraining by gender and loss weight distribution LW3 together (4c2fCNNptLW3) achieves better performance than other models. These experiments demonstrate the effectiveness of pretraining method by gender and weighted loss layer for promoting performance of age group classification.
IvC Age group and gender classification by PreRoR
In order to find the optimal model of PreRoR on Adience data set, we do a lot of comparative experiments with folder0 validation, and then we evaluate the effect of SD, dropout, shortcut type, block type, maximum epoch number and depth for age estimation results.
Method  Age Exact Accuracy(%)  Age 1off(%)  Gender Accuracy(%) 

PreResNets34 (Type B)  58.81  88.31  90.23 
PreResNets34+SD (Type B)  59.56  90.43  89.91 
PreRoR34+SD (Type B)  60.21  91.14  90.72 
PreRoR34+SD+dropout (Type B)  59.87  88.68  90.32 
PreRoR34+SD (Type A+B)  61.56  91.59  90.78 
PreRoR34+SD (Type A+B) 300 epochs  61.52  91.56  90.84 
PreRoR58+SD (Type A+B)  62.48  92.31  90.85 
PreRoR82+SD (Type A+B)  61.78  92.15  90.87 
Firstly, basic blocks are used in experiments, and the results of different architectures are shown in Table III. We do some experiments by PreResNets34 (34 convolutional layers) with and without SD. Because Adience data set only has about 26,580 highresolution images, overfitting is a critical problem. In Table III, the performance of PreResNets34 with SD is better than that without SD, which means SD alleviates the effect of overfitting. We then use PreRoR34 +SD to estimate age and gender. PreRoR34+SD outperforms PreResNets34+SD, because RoR can promote the learning capability of residual networks. To further reduce overfitting, we try dropout between convolutional layers in residual blocks, but the result of PreRoR34+SD+dropout shows that dropout method in RoR does not make a big difference. This is consistent with WRN [55]. Zhang et al. [43] noted that extra parameters would escalate overfitting and the zeropadding (type A) would bring more deviation, so shortcut Type A should be used in the finallevel and Type B should be used in other levels (called Type A+B). Table III shows that the PreRoR34+SD with Type A+B has better performance than PreRoR34+SD which uses Type B in all levels. Fig. 7 shows that the test errors by PreResNets34, PreResNets34+SD and PreRoR34+SD (Type A+B) at different training epochs with folder0 validation. Zhang et al. [43] proofed that maximum epoch number of 500 is necessary to optimize RoR on CIFAR10 and CIFAR100, but the results of PreRoR34+SD with 300 epochs show that 164 for maximum epoch number is enough for Adience data set. Generally, ResNets [26] and RoR [43] can improve performance by increasing depth. We estimate age and gender by PreRoR58+SD and PreRoR82+SD. The age estimation result of PreRoR58+SD is better than PreRoR34+SD, but PreRoR82+SD is worse than PreRoR58+SD, which is caused by degradation. Gender estimation gets better when adding more layers, since degradation is less critical for binary classification.
Secondly, we use bottleneck blocks instead of basic blocks, and the results of different architectures are shown in Table IV and Table V. We do some experiments by PreResNets50+SD (Type B, =4) and PreRoR50+SD (Type A+B, =4). As can be observed, the performance of PreRoR50+SD (Type A+B, =4) is worse than PreResNets50+SD (Type B, =4). When we use type A in final levels, the input and output planes of these shortcuts are very different, the zeropadding (type A) will bring more deviation. So we reduce the output dimensions by using =2 and =1. The results of PreRoR50+SD (Type A+B, =2) and PreRoR50+SD (Type A+B, =1) show that deviation problem is largely alleviated by reducing dimensions. The performance of PreRoR50+SD (Type A+B, =2) is better than PreRoR50+SD (Type A+B, =1), because reducing dimensions also reduces parameters and the optimizing ability of networks. PreRoR50+SD (Type A+B, =2) achieves the balance of deviation and overfitting problems, but it can not catch up PreRoR with basic blocks because of these two problems.
Method  Age Exact Acc(%)  Age 1off(%)  Gender Acc(%) 

PreResNets50+SD (Type B) =4  60.05  88.98  89.82 
PreRoR50+SD (Type A+B) =4  58.62  90.10  88.71 
PreRoR50+SD (Type A+B) =2  61.68  91.63  88.92 
PreRoR50+SD (Type A+B) =1  61.12  91.14  90.03 
We do the same experiments by increasing the depth to 101 convolutional layers. We find the similar results shown in Table V as the networks with 50 convolutional layers in Table IV. PreRoR101+SD (Type A+B, =2) achieves the best performance, and also outperforms PreRoR50+SD (Type A+B, =2).
Method  Age Exact Acc(%)  Age 1off(%)  Gender Acc(%) 

PreResNets101+SD (Type B) =4  59.16  89.61  89.12 
PreRoR101+SD (Type A+B) =4  60.46  90.95  88.37 
PreRoR101+SD (Type A+B) =2  62.26  91.54  89.15 
PreRoR101+SD (Type A+B) =1  60.49  91.14  89.41 
In above experiments, we only use one folder to analyze different network architectures. Now we will demonstrate the generality of our method by using standard fivefold, subjectexclusive crossvalidation protocol. In the following experiments, we only use Type A+B for PreRoR+SD. The age crossvalidation results of PreRoR+SD (Type A+B) with different block types and depths are shown in Table VI, where we achieve the similar results with folder0 validation. The performance of PreRoR+SD with basic block is better than PreRoR+SD with bottleneck block. We analyze that this is because of deviation by zeropadding. Our PreROR58+SD achieves the best performance, which outperforms 4c2fCNN by 18.8% and 5.7% on exact and 1off accuracy of Adience data set.
Method  Exact Acc(%)  1off(%) 

4c2fCNN  52.624.37  88.612.27 
VGG16  54.644.76  54.644.76 
PreResNets34  60.153.99  90.901.67 
PreResNets34+SD  60.984.21  91.871.73 
PreRoR50+SD =2  61.314.29  93.451.34 
PreRoR50+SD =1  61.004.15  93.191.67 
PreRoR101+SD =2  61.544.97  93.371.72 
PreRoR101+SD =1  61.254.54  93.521.59 
PreRoR34+SD  62.354.69  93.551.90 
PreRoR58+SD  62.504.33  93.631.90 
PreRoR82+SD  62.144.10  93.681.22 
IvD Age group and gender classification by Pretraining on ImageNet
Because we can not find the welltrained PreResNets on the web, we construct RoR based on the welltrained ResNets from [52] for ImageNet. The welltrained ResNets from [52] use Type B in the residual blocks, so we use Type B in all levels of RoR. We use SGD with a minibatch size of 128 (18 layers and 34 layers) or 64 (101 layers) or 48 (152 layers) for 10 epochs to finetune RoR. The learning rate starts from 0.001 and is divided by a factor of 10 after epoch 5. For data augmentation, we use scale and aspect ratio augmentation [52]. Both Top1 and Top5 error rates with 10crop testing are evaluated. From Table VII, our implementation of residual networks achieves the best performance compared to ResNets methods for single model evaluation on validation data set. These experiments verify the effectiveness of RoR on ImageNet.
Method  Top1 Error  Top5 Error 

ResNets18 [52]  28.22  9.42 
RoR18  27.84  9.22 
ResNets34 [26]  24.52  7.46 
ResNets34 [52]  24.76  7.35 
RoR34  24.47  7.13 
ResNets101 [26]  21.75  6.05 
ResNets101 [52]  21.08  5.35 
RoR101  20.89  5.24 
ResNets152 [26]  21.43  5.71 
ResNets152 [52]  20.69  5.21 
RoR152  20.55  5.14 
When we use pretrained RoR model to finetune on Adience, we replace the 1000 classes prediction layer with age or gender prediction layer. We use SGD with a minibatch size of 64 for 120 epochs to finetune on Adience. The learning rate starts from 0.01 and is divided by a factor of 10 after epoch 80. Based on the analysis of above section, we find deep PreRoR maybe outperform very deep PreRoR, so we use RoR34 instead of deeper RoR as the basic pretrained model. The results of different methods are shown in Table VIII. We do some experiments by ResNets34 and RoR34. The results of ResNets34 and RoR34 by Pretraining on ImageNet are better than the results of ResNets34 and RoR34, because pretraining on ImageNet can reduce overfitting problem. When we add SD method in these experiments, the performance are promoted too. Especially, RoR34+SD by Pretraining on ImageNet achieves very competitive performance, which outperforms PreRoR34+SD. These experiments verify the effectiveness of pretraining on ImageNet for age group and gender classification.
Method  Age Exact Acc(%)  Age 1off(%)  Gender Acc(%) 

ResNets34  59.394.45  91.981.57  90.121.48 
ResNets34 by Pretraining on ImageNet  61.154.53  92.901.98  91.181.53 
ResNets34+SD by Pretraining on ImageNet  61.475.17  93.391.95  91.981.49 
RoR34  60.294.25  92.441.45  91.071.64 
RoR34 by Pretraining on ImageNet  61.734.31  92.971.55  91.961.53 
RoR34+SD by Pretraining on ImageNet  62.344.53  93.641.47  92.431.51 
IvE Age group and gender classification by finetuning on IMDBWIKI101
As the amount of training data strongly affects the accuracy of the trained models, there is a greater need for large datasets. Thus, we use IMDBWIKI101 to further finetune the RoR model. After pretraining on the ImageNet, we further finetune the RoR model on the IMDBWIKI101. The epoch is set to 120. The learning rate starts from 0.01 and is divided by a factor of 10 after epoch 60 and 90. When we use finetuned RoR model to finetune on Adience, we replace the 101 classes prediction layer with age or gender prediction layer. The epoch is set to 60. The learning rate is set to 0.0001.
As shown in Table IX, with the IMDBWIKI101 data set finetuning, both the performances of ResNets34 and RoR34 model have been significantly improved. This shows that having a large data set with face age images results in better performance. The performance of RoR34 finetuning on the IMDBWIKI101 data set reaches the age exact accuracy of 66.74%(1off 97.38%) compared to 60.29% (1off 92.44%) when training directly on the Adience data set. That is competitive performance on Adience data set for age group and gender classification in the wild.
When we only use ImageNet data set to pretrain the RoR34 model, the age estimation results on Adience with stochastic depth algorithm are better than without stochastic depth algorithm. However, when we first use the ImageNet dataset to pretrain the RoR34 network, and then use the IMDBWIKI101 data set to finetune the RoR34 network, the age estimation results on the Adience with stochastic depth algorithm are worse than without stochastic depth algorithm. The reason is that the ImageNet dataset is an object image dataset, the network can learn the feature expression of general object, adding the stochastic depth algorithm to the original network is effective for the results. However, the IMDBWIKI101 is a largescale face image data set. The RoR34 network can fully learn the characteristics of face images from the IMDBWIKI101 data set, which reduces the problem of overfitting. After adding stochastic depth algorithm, the original structure of the network will be changed, so the network needs to relearn the characteristics of facial image parameters, that is the reason why the results with SD are not better than the results without SD.
Method  Age Exact Acc(%)  Age 1off(%)  Gender Acc(%) 

ResNets34+ IMDBWIKI  66.633.04  97.200.65  93.171.57 
RoR34+ IMDBWIKI +SD  66.422.64  97.350.65  92.901.76 
RoR34+ IMDBWIKI  66.742.69  97.380.65  93.241.77 
IvF Comparisons with stateoftheart results of age group and gender classification on Adience
To begin with, we use 4c2fCNN, VGG16, PreResNets, our RoR+SD by Pretraining on ImageNet and PreRoR+SD architectures to estimate gender. In addition, we use IMDBWIKI101 dataset to finetune the ResNets34 and RoR34 for gender estimation. The gender crossvalidation results by different methods are shown in Table X. RoR34+SD achieves a competitive accuracy 92.43% by only pretraining on ImageNet, and RoR34+IMDBWIKI achieves the best accuracy 93.24%, which outperforms 4c2fCNN [30] by 6.44%.
Method  Exact Accuracy(%) 

SVMdropout [15]  79.30.0 
4c2fCNN [30]  86.81.4 
4c2fCNN  87.501.56 
VGG16  88.361.69 
PreResNets34  92.041.51 
PreRoR50+SD =2  90.451.39 
PreRoR50+SD =1  90.661.41 
PreRoR101+SD =2  91.091.44 
PreRoR101+SD =1  91.311.54 
PreRoR34+SD  92.181.51 
PreRoR58+SD  92.291.49 
PreRoR82+SD  92.371.52 
RoR34+SD by Pretraining on ImageNet  92.431.51 
ResNets34+ IMDBWIKI  93.171.57 
RoR34+ IMDBWIKI  93.241.77 
Then, we use 4c2fCNN, VGG16, PreResNets, our RoR34+SD by Pretraining on ImageNet and PreRoR58+SD (Type A+B) architectures with the two mechanisms to estimate age. Furthermore, we use IMDBWIKI101 dataset to finetune the ResNets34 and RoR34, and then with the two mechanisms for further age estimation on Adience.Table XI compares the stateoftheart methods for age group classification on Adience data set. We find that the accuracy increases with the largescale face image dataset finetuning the network, and two mechanisms will further improve each architecture, which demonstrates the versatility of two mechanisms in different models. Fig. 8 shows the test errors by PreROR58+SD and PreROR58+SD with two mechanisms at different training epochs with folder0 validation. In addition, we notice that the effect of RoR34+IMDBWIKI with two mechanisms is a little better than RoR34+IMDBWIKI without two mechanisms. We argue that this is because of welltrained model by IMDBWIKI.
Method  Exact Acc(%)  1off(%) 

SVMdropout [15]  45.12.6  79.51.4 
4c2fCNN [30]  50.75.1  84.72.2 
Chained genderage CNN [31]  54.5  84.1 
RSAAFc2 [35]  53.5  87.9 
DEX w/o IMDBWIKI pretrain [1]  55.66.1  89.71.8 
DEX w/ IMDBWIKI pretrain [1]  64.04.2  96.600.90 
RESEMD [36]  62.2  94.3 
DAPP [38]  62.2  – 
RSAAFc2(IMDBWIKI) [39]  67.3  97.0 
4c2fCNN  52.624.37  88.612.27 
4c2fCNN with two mechanisms  53.963.80  90.041.54 
VGG16  54.644.76  89.931.87 
VGG16 with two mechanisms  56.115.05  90.662.14 
PreResNets34  60.153.99  90.901.67 
PreResNets34 with two mechanisms  61.894.16  93.501.33 
PreRoR58+SD  62.504.33  93.631.90 
PreRoR58+SD with two mechanisms  64.173.81  95.771.24 
RoR34+SD by Pretraining on ImageNet  62.344.53  93.641.47 
RoR34+SD by Pretraining on ImageNet with two mechanisms  63.764.18  94.921.42 
RoR34+ IMDBWIKI  66.742.69  97.380.65 
RoR34+ IMDBWIKI with two mechanisms  66.912.51  97.490.76 
RoR152+ IMDBWIKI with two mechanisms  67.343.56  97.510.67 
As shown in Table XI, without using ImageNet and IMDBWIKI101 datasets, the accuracy of PreROR58+SD with two mechanisms is better than 64.04.2% of DEX which pretrained on ImageNet and IMDBWIKI (523,051 face images) [1]. Although DEX can achieve competitive results, it needs very large data set IMDBWIKI for pretraining. Our method can learn age and gender representation from scratch without the IMDBWIKI and achieve the best performance. Our VGG16 with two mechanisms also outperforms DEX (also based on VGG16) which only pretrained on ImageNet but without IMDBWIKI. These results demonstrate that our method can improve the optimization ability of networks and alleviate overfitting on Adience data set. Moreover, by pretraining on ImageNet RoR34+SD with two mechanisms also achieves 63.764.18% of accuracy, which is very close to the accuracy in [1], so we have reason to believe that better performance can be achieved by pretraining on more extra data sets. Particularly, our RoR34+IMDBWIKI with two mechanisms obtains a singlemodel accuracy of 66.912.51% , and the 1off accuracy of 97.490.76% on Adience. But the singlemodel accuracy is slightly lower than the accuracy in [39]. Because compared with VGG used in [39] RoR34 is small. So we use RoR152+IMDBWIKI to repeat the experiments, we get the new stateoftheart performance (a singlemodel accuracy of 67.343.56%) to our best knowledge now.
V Conclusion
This paper proposes a new Residual networks of Residual networks (RoR) architecture for highresolution facial images age and gender classification in the wild. Two modest mechanisms, pretraining by gender and training with weighted loss layer, are used to improve the performance of age estimation. Pretraining on ImageNet is used to alleviate overfitting. Further finetuning on IMDBWIKI101 is for the purpose of learning the features of face images. By RoR or PreRoR with two mechanisms, we obtain new stateoftheart performance on Adience data set for age group and gender classification in the wild. Through empirical studies, this work not only significantly advances the age group and gender classification performance, but also explores the application of RoR on large scale and highresolution image classifications in the future.
Acknowledgment
The authors would like to thank the editor and the anonymous reviewers for their careful reading and valuable remarks.
References

[1]
R. Rothe, R. Timofte, and L. Gool, “Deep expectation of real and apparent age from a single image without facial landmarks,”
International Journal of Computer Vision
. 2016.  [2] Z. Ma, and A. Leijon, “Bayesian estimation of beta mixture models with variational inference,” IEEE Trans. on Pattern Analysis and Machine Intelligence,vol. 33, no. 11, pp. 2160–2173, Nov. 2011.
 [3] Z. Ma, A. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo “Variational bayesian matrix factorization for bounded support data,” IEEE Trans. on Pattern Analysis and Machine Intelligence,vol. 37, no. 4, pp. 876–889, Apr. 2015.
 [4] F. Gao, and H. Ai, “Age classification on consumer images with gabor feature and fuzzy lda method,” in Proc. International Conference on Biometrics, 2009, pp. 132–141.
 [5] S. Yan, M. Liu and T. Huang, “Extracting age information from local spatially flexible patches,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 737–740.
 [6] G. Guo, G. Mu, Y. Fu and T. Huang, “Human age estimation using bioinspired features,” in Proc. CVPR, 2009, pp. 112–119.
 [7] Y. Fu, and T. Huang, “Human age estimation with regression on discriminative aging manifold,” IEEE Transactions on Multimedia, vol. 10, no. 4, pp. 578–584, Apr. 2008.
 [8] G. Guo, Y. Fu, C. Dyer and T. Huang, “Imagebased human age estimation by manifold learning and locally adjusted robust regression,” IEEE Transactions on Image Processing, vol. 17, no. 7, pp. 1178–1188, Jul. 2008.
 [9] G. Guo and G. Mu, “Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression,” in Proc. CVPR, 2011, pp. 657–664.
 [10] G. Guo and G. Mu, “Joint estimation of age, gender and ethnicity: CCA vs. PLS,” in Proc. IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 2013, pp. 1–6.
 [11] A. Lanitis, D. Chrisina, and C. Chris, “Comparing different classifiers for automatic age estimation,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 34, no. 1, pp. 621–628, Jan. 2004.
 [12] R. Rothe, R. Timofte and L. Gool, “Morph: a longitudinal image database of normal adult ageprogression,” in Proc. International Conference on Automatic Face and Gesture Recognition, 2006, pp. 341–345.
 [13] N. Kumar, A. Berg, P. Belhumeur, and S. Nayar, “Attribute and simile classifiers for face verification,” in Proc. ICCV, 2009, pp. 365–372.
 [14] A. Gallagher, and T. Chen, “Understanding images of groups of people,” in Proc. CVPR, 2009, pp. 256–263.
 [15] E. Eidinger, R. Enbar, and T. Hassner, “Age and gender estimation of unfiltered faces,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2170–2179, Dec. 2014.
 [16] S. Escalera, J. Gonzalez, X. Baro, and P. Pardo, “ChaLearn looking at people 2015 new competitions: Age estimation and cultural event recognition,” International Joint Conference on Neural Networks. IEEE, 2015, pp. 1–8.
 [17] A. Krizhenvshky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
 [18] W. Y. Zou, X. Y. Wang, M. Sun, and Y. Lin, “Generic object detection with dense neural patterns and regional,” arXiv preprint arXiv:1404.4316, 2014.
 [19] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
 [20] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
 [21] K. Simonyan, and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [22] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
 [23] C. Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeplysupervised nets,” in Proc. AISTATS, 2015, pp. 562–570.
 [24] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
 [25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1–9.
 [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
 [27] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features offtheshelf: an astounding baseline for recognition,” arXiv preprint arXiv:1403.6382, 2014.
 [28] D. Yi, Z. Lei, and S. Li, “Age estimation by multiscale convolutional network,” in Proc. ACCV, 2014, pp. 144–158.
 [29] X. Wang, R. Guo, and C. Kambhamettu, “DeeplyLearned feature for age estimation,” in Proc. IEEE Winter Conference on Applications of Computer Vision, 2015, pp. 534–541.
 [30] G. Levi, and T. Hassner, “Age and gender classification using convolutional neural networks,” in Proc. CVPR Workshop, 2015, pp. 34–42.
 [31] A. Ekmekji, “Convolutional neural networks for age and gender classification,” Research report, 2016.
 [32] X. Liu, S. Li, M. Kan, et al, “AgeNet: deeply learned regressor and classifier for robust apparent age estimation,” in Proc. ICCV Workshop, 2015, pp. 16–24.
 [33] G. Antipov, M. Baccouche, S. Berrani, et al, “Apparent age estimation from face images combining general and childrenspecialized deep learning models,” in Proc. CVPR Workshop, 2016, pp. 96–104.
 [34] Z. Huo, X. Zhang, C. Xing, et al, “Deep age distribution learning for apparent age estimation,” in Proc. CVPR Workshop, 2016, pp. 17–24.
 [35] L. Hou, D. Samaras, T. Kurc, Y. Gao and J. Saltz, “Neural networks with smooth adaptive activation functions for regression,” arXiv preprint arXiv:1608.06557, 2016.
 [36] L. Hou, C.P. Yu, D. Samaras, “Squared Earth Mover’s Distancebased Loss for training deep neural networks,” arXiv preprint arXiv:1611.05916, 2016.
 [37] R. Rothe, R. Timofte, and L. Gool, “Some like it hotvisual guidance for preference prediction,” arXiv preprint arXiv:1510.07867, 2015.
 [38] M. Iqbal, M. Shoyaib, B. Ryu, et al, “Directional ageprimitive pattern (DAPP) for human age group recognition and age estimation, ” IEEE Transactions on Information Forensics and Security, accepted. 2017.

[39]
L. Hou, D. Samaras, T. Kurc, Y. Gao, J. Saltz, “ConvNets with Smooth Adaptive Activation Functions for Regression, in ”
Proc. International Conference on Artificial Intelligence and Statistics
, 2017, pp. 430–439.  [40] D. Han, J. Kim, J. Kim, “Deep pyramidal residual networks,” in Proc. CVPR., 2017.
 [41] G. Huang, Z. Liu, K. Weinberger, and L. Maaten, “Densely connected convolutional networks,” in Proc. CVPR., 2017.
 [42] Y. Chen, J. Li, H. Xiao, et al, “Dual Path Networks,” in Proc. CVPR., 2017.
 [43] K. Zhang, M. Sun, T. Han, X. Yuan, L. Guo, and T. Liu, “Residual networks of residual networks: multilevel residual networks,” IEEE Transactions on Circuits and Systems for Video Technology, accepted. 2017.
 [44] Y. Kwon, and N. Lobo, “Age classification from facial images,” in Proc. CVPR, 1994, pp. 762–767.
 [45] A. Gunay, and V. Nabiyev, “Automatic age classification with LBP,” in Proc. International Symposium on Computer and Information Sciences, 2008, pp. 1–4.
 [46] C. Shan, “Learning local features for age estimation on reallife faces,” in Proc. ACM international workshop on Multimodal Pervasive Video Analysis, 2010, pp. 23–28.
 [47] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mapping in deep residual networks,” arXiv preprint arXiv:1603.05027, 2016.
 [48] G. Huang, Y. Sun, Z. Liu, and K. Weinberger, “Deep networks with stochastic depth,” arXiv preprint arXiv:1605.09382, 2016.
 [49] A. Krizhenvshky, and G. Hinton, “Learning multiple layers of features from tiny images,” M.Sc. thesis, Dept. of Comput. Sci., Univ. of Toronto, Toronto, ON, Canada, 2009.
 [50] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in Proc. NIPS Workshop Deep Learning and Unsupervised feature learning., 2011, pp. 1–9.
 [51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “Imagenet large scale visual recognition challenge,” arXiv preprint arXiv:1409.0575, 2014.
 [52] S. Gross, and M. Wilber, “Training and investigating residual nets,” Facebook AI Research, CA. [Online]. Avilable:http://torch.ch/blog/2016/02/04/resnets.html, 2016.
 [53] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” arXiv preprint arXiv:1502.01852, 2015.
 [54] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in unconstrained images,” in Proc. CVPR., 2015, pp. 4295–4304.
 [55] S. Zagoruyko, and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
Comments
There are no comments yet.