Admin مدير المنتدى


عدد المساهمات : 18312 التقييم : 33662 تاريخ التسجيل : 01/07/2009 الدولة : مصر العمل : مدير منتدى هندسة الإنتاج والتصميم الميكانيكى
 | موضوع: رسالة دكتوراه بعنوان Learning Semantic Features for Visual Recognition الأحد 28 أغسطس 2022, 2:08 am | |
| 
أخواني في الله أحضرت لكم رسالة دكتوراه بعنوان Learning Semantic Features for Visual Recognition by JINGEN LIU M.S., University of Central Florida, 2008 M.E., Huazhong University of Science and Technology, 2003 B.S., Huazhong University of Science and Technology, 2000 A dissertation submitted in partial fulflllment of the requirements for the degree of Doctor of Philosophy in the School of Electrical Engineering and Computer Science in the College of Engineering and Computer Science at the University of Central Florrida Orlando, Florida
 و المحتوى كما يلي :
TABLE OF CONTENTS LIST OF FIGURES xvii LIST OF TABLES 1 CHAPTER 1: INTRODUCTION 2 1.1 Motivations 6 1.2 Proposed Work and Contributions 13 1.2.1 Action recognition via maximization of mutual information 14 1.2.2 Scene recognition using MMI co-clustering . 15 1.2.3 Visual recognition using multiple features 17 1.2.4 Learning semantic visual vocabularies using difiusion distance 18 1.3 Organization of the Thesis 19 CHAPTER 2: LITERATURE REVIEW 21 2.1 Object and Scene Recognition 21 2.1.1 Geometry-Based Models . 22 2.1.2 Appearance-based Models 23 2.2 Action Recognition 31 2.2.1 Holistic Approaches 32 2.2.2 Part-based Approaches 34 2.3 Semantic Visual Vocabulary . 37 CHAPTER 3: LEARNING OPTIMAL NUMBER OF VISUAL WORDS FOR ACTION RECOGNITION 40 3.1 Introduction 40 3.2 Bag of Video-words Model 43 vii3.2.1 Feature Detection and Representation . 43 3.2.2 Action Descriptor . 45 3.3 Clustering of Video-words by MMI . 46 3.3.0.1 Mutual Information 46 3.3.0.2 MMI clustering Algorithm . 47 3.4 Spatiotemporal Structural Information . 49 3.5 Experiments and Discussion . 51 3.5.1 Experiments on KTH data set 53 3.5.1.1 Action recognition using orderless features 53 3.5.1.2 Classiflcation using spatiotemporal structural information . 56 3.5.2 IXMAS Multiview dataset 59 3.6 Conclusion . 62 CHAPTER 4: LEARNING SEMANTIC VISUAL-WORDS BY CO-CLUSTERING FOR SCENE CLASSIFICATION 63 4.1 Introduction 63 4.2 Co-clustering by Maximization of Mutual Information . 65 4.2.1 Co-clustering Algorithm . 67 4.3 Spatial Correlogram Kernel Matching 69 4.3.1 Spatial correlgram 70 4.3.2 Spatial Correlogram Kernel . 71 4.4 Experiments 72 4.4.1 Classiflcation of Fifteen Scene Categories 73 4.4.1.1 Classiflcation using orderless features . 74 4.4.1.2 Classiflcation using intermediate concepts and their spatial information . 78 4.4.2 Classiflcation of LSCOM Dataset 81 viii4.5 Conclusion . 84 CHAPTER 5: VISUAL RECOGNITION USING MULTIPLE HETEROGENOUS FEATURES 85 5.1 Introduction 85 5.2 Fiedler Embedding 87 5.2.1 Fiedler Embedding and LSA . 90 5.3 Object Classiflcation Framework . 91 5.3.1 Constructing Feature Laplacian Matrix . 95 5.3.2 Embedding 97 5.4 Action Recognition Framework . 97 5.4.1 Feature Extraction and Representation . 98 5.4.2 Construction of the Laplacian Matrix 101 5.5 Experiments and Discussion . 104 5.5.1 Synthetic Data Set 104 5.5.2 Caltech Data Set: Object Recognition . 106 5.5.2.1 Qualitative Results . 108 5.5.2.2 Quantitative Results 110 5.5.3 Weizmann Data Set: Action Recognition 115 5.6 Conclusion . 119 CHAPTER 6: LEARNING SEMANTIC VOCABULARIES USING DIFFUSION DISTANCE 121 6.1 Introduction 121 6.2 Difiusion Maps 123 6.2.1 Difiusion distances in a graph 123 6.2.2 Difiusion Maps Embedding 125 6.2.3 Robustness to Noise . 127 ix6.2.4 Feature Extraction 129 6.3 Experiments and Discussion . 130 6.3.1 Experiments on KTH data set 134 6.3.2 Experiments on YouTube data set 137 6.3.3 Experiments on Scene data set 139 6.4 Conclusion . 139 CHAPTER 7: CONCLUSION AND FUTURE WORK 140 7.1 Summary of Contributions 140 7.2 Future Work 142 7.2.1 Reflne the output of information bottleneck 142 7.2.2 Semi-supervised method . 142 7.2.3 multi-scale matching . 143 7.2.4 E–cient shape model . 143 REFERENCES 157 LIST OF FIGURES 1.1 Example object images selected from Caltech-6 dataset showing the variation in scales, viewpoints, and illumination changes. Each row corresponds to one category 3 1.2 Example scene images selected from the flfteen-scene data set. The number of images contained in each category is shown under the images . 4 1.3 Example actions selected from the KTH dataset. Each column shows two action examples from one category. It has 6 categories with about 600 action videos in total . 5 1.4 Example actions from the Weizmann action data set. It contains 9 actions with about 81 action videos in total . 5 1.5 Four views of flve selected action examples from IXMAS dataset. It has 13 action categories with 5 camera views and about 2,000 video sequences in total 6 1.6 Examples actions from UCF YouTube action data set. It contains 11 action categories (here, eight categories are listed). Each category contains more than 100 video clips 7 1.7 The Maximum Response fllters ( this flgure is taken from [87] ). They include two anisotropic fllters with 3 scales at 6 orientations capturing the edges (the dark images of the flrst three rows) and bars (the light images of the flrst three rows), and 2 rotationally symmetric fllters (a Gaussian and a Laplacian Gaussian) . 9 1.8 Two visual words demonstrate the polysemy and synonym problems in visual vocabulary learning 12 xi1.9 Representation of an image in terms of multiple features. (a) The original image. (b) Interest points (SIFT) representing local features. (c) Contours representing shape features. (d) Segments representing region features 17 2.1 The demonstration of SIFT descriptor (this flgure is taken from [79]). The left panel shows the gradients of an image patch that is divided into 2£2 subregions. The overlaid circle is the Gaussian window weighting the gradients. These gradients are accumulated into orientation histograms, as shown on the right panel. The length of the arrow represents the sum of the gradient magnitudes in the corresponding direction bin . 28 2.2 Two images having similar color histograms (the images are original in [51]). 29 2.3 Examples showing arbitrary view action recognition (this flgure is from [30]).The fourth and third row are the observed image sequences and their corresponding silhouettes. The second and flrst row are the matched silhouettes and their corresponding 3-D exemplars 33 2.4 (A) Motion energy images (MEI) and motion history image (MHI)(this flgure is taken from [3]); (B) Space-time interest points are detected by 3-D Harriscorner detector (this flgure is taken from [46]); (C)Space-time interest points are detected by 1D Gabor detector in time direction (this flgure is taken from [93]) . 34 3.1 Illustration of the procedure of representing an action as a bag of video-words (histogram of bag of video-words) 42 3.2 (a)The classiflcation performance comparison between the initial vocabulary and the optimal vocabulary with difierent initial vocabulary sizes. (b) The performance comparison between using MMI clustering and directly applying k-means algorithm. MMI clustering reduces the initial dimension of 1,000 to the corresponding number . 52 xii3.3 (a) Confusion table for the classiflcation using the optimal number of VWC s (Nc=177, average accuracy is 91.31%). (b) Confusion table for the classiflcation using the VWC correlogram. The number of VWC is 60, and 3 quantized distances are used (average accuracy is 94.15%) . 53 3.4 The flrst row shows the examples of six actions. The following two rows respectively demonstrate the distribution of the optimal 20 video-words clusters using our approach and 20 video-words using k-means. We superimpose the 3D interest points in all frames into one image. Difierent clusters are represented by difierent color codes. Note that our model is more compact, e.g. see \waving" and \running" actions (Best viewed in color) 56 3.5 Example histograms of the VWC s (Nc=20) for two selected testing actions from each action category. These demonstrate that actions from the same category have similar VWC distribution, which means each category has some dominating VWC s . 57 3.6 (a) Performance (%) comparison between the original 1,000 video-words and the optimal 189 video-word-clusters. (b) Average accuracy (%) using three views for training and single view for testing . 59 3.7 The recognition performance when four views are used for training and a single view is used for testing. The average accuracy is 82.8% 60 3.8 The recognition performance when four views are used for training and single view is used for testing . 61 4.1 An illustration of hierarchical scene understanding . 64 4.2 Work flow of the proposed scene classiflcation framework 66 4.3 The graphical explanation of MMI co-clustering. The goal of MMI co-clustering is to flnd one clustering of X and Y that minimizes the distance between the distribution matrices p(x,y) and q(x,y) 66 xiii4.4 An example to show autocorrelogram of three synthetic images 73 4.5 Example histograms of intermediate concepts for 2 selected testing images from each scene categories 77 4.6 Confusion table of the best performance for the SCC+BOC model. The average performance is 81.72% 80 4.7 Example key frames selected from LSCOM Data set . 80 4.8 The AP for the 28 categories. BOV-O and BOV-D represents the BOV models with Nv = 3; 000 and Nv = 250 respectively. CC-BOC and pLSA-BOC denotes the BOC model created by co-clustering and pLSA 83 5.1 An illustration of the graph containing multiple entities as nodes. This includes images (red), SIFT descriptors (green), contours (purple) and regions (yellow). The goal of our algorithm is to embed this graph in a k-dimensional space so that semantically related nodes have geometric coordinates which are closer to each other. (Please print in color) . 88 5.2 The flgure shows two visual words each from the interest point, contour and region vocabularies. (a)-(b) Two words belonging to the interest point vocabulary. (c)-(d) Two words belonging to the contour vocabulary. (e)-(f) Two words belonging to the region vocabulary 93 5.3 An illustration of the graph containing multiple entities as nodes. This includes ST features (red), Spin-Image features (yellow) and action videos (green). The goal of our algorithm is to embed this graph in a k-dimensional space so that similar nodes have geometric coordinates which are closer to each other. 98 5.4 Left: the (fi; fl) coordinates of a surface point relative to the orientated point O. Right: the spin-image centered at O . 100 5.5 Some 3D (x,y,t) action volumes (the flrst column) with some of their sampled spin-images (red points are the orientated points.) 102 xiv5.6 Clustering of entities in the k-dimensional embedding space. The entities are three images categories D1, D2, and D3, and flve feature types T 1, T 2, T 3, T 4, and T 5. The synthetically generated co-occurrence table between features and images is shown on the left side. While the graph represents the assignments of feature types and image category to the clusters in the 3¡dimensional embedding space . 104 5.7 Qualitative results when the query is a feature and the output is a set of images. (a)-(b) Query: Interest point, Output: Ten nearest images. (c)-(d) Query: Contour, Output: Ten nearest images. (e)-(f) Query: Region, Output: Ten nearest images . 107 5.8 The results of difierent combinations of the query-result entities. (a) Query: Interest Point, Output: Five nearest interest points and contours. (b) Query: Contour, Output: Five nearest interest points and regions. (c) Query: Regions, Output: Five nearest interest points, contours, and regions. (d) Query: Image, Output: Five nearest interest points and contours. (e) Query: Image, Output: Five nearest interest points, contours, and regions. (f) Query: Image, Output: Five nearest images . 109 5.9 Figure summarizes results of difierent experiments. (a) A comparison of BOW approach with our method by using only the interest point features. (b) A comparison of BOW approach with our method by using both interest point and contour features together. (c) A comparison of BOW approach with our method by using all three features. (d) A comparison of Fielder embedding with LSA by using all three feature types. (e) A comparison of performance of our framework for difierent values of embedding dimension using only the interest point features. (f) A comparison of contributions of difierent features towards classiflcation 111 xv5.10 Figure shows difierent combinations of query-result that we used for qualitative veriflcation of the constructed k-dimensional space. Each rectangle represents one entity (e.g. action video or a video-word (a group of features)). In (a)-(c), the features in blue which are from one video-word are used as query, and the 4 nearest videos in yellow from the k-dimensional space are returned. Under each video-word, the category component percentage is also shown (e.g. \wave2: 99%, wave1:1%\ means 99% of features in this video-word are from \wave2" action). In (d) and (e), we respectively used ST features and Spin-Image features as query, and retrieved the nearest features in the k-dimensional space. In (f) and (g), two action videos are used as query, and the nearest features are returned . 116 5.11 The comparison of the BOW approach with our weighted BOW method . 117 5.12 (a) Confusion table for Fiedler embedding with k=20. (b) Confusion table for LSA with k=25 . 118 5.13 The variation of embedding dimension afiects the performance. All experiments are carried out on Nsi = Nip = 1; 000 . 118 5.14 The contributions of difierent features to the classiflcation. (a)Nsi = Nip = 200, k=20,30 and 20 for ST features, Spin-Image features and the combination respectively. (b)Nsi = Nip = 1; 000, k=20,70 and 20 for ST features, SpinImage features and the combination respectively . 119 6.1 Flowchart of learning semantic visual vocabulary . 123 6.2 Demonstration of robustness to noise. (a) Two dimensional spiral points. (bc) The distribution of the difiusion distance and geodesic distance between points A and B. (d) KTH data set. (e-f) The distribution of the difiusion distance and geodesic distance between two points on KTH data set . 128 xvi6.3 (a) and (b) shows the influence of difiusion time and sigma value, respectively, on the recognition performance. The three curves correspond to three visual vocabularies of size 100, 200, and 300 respectively. The sigma value is 3 in (a) and the difiusion time is 5 in (b); (c) The comparison of recognition rate between mid-level and high-level features 131 6.4 (a) Comparison of performance between difierent manifold learning schemes. (b) Comparison of performance between DM and IB . 132 6.5 (a) Confusion table of KTH data set when the size of the semantic visual vocabulary is 100. The average accuracy is 92.3%. (b) Performance comparison between DM and other manifold learning schemes on the YouTube action data set. (c) Confusion table of the YouTube data set when the size of semantic visual vocabulary is 250. The average accuracy is 76.1% 133 6.6 The decay of the eigenvalues of Pt on YouTube data set when sigma is 14 134 6.7 Some examples of mid-level and high-level features with their corresponding real image patches. Each row lists one mid-level or high-level feature followed by its image patches. The three mid-level features are selected from 40 midlevel features. The four high-level features are selected from 40 high-level features generated by DM from 1,000 mid-level features 135 xviiLIST OF TABLES 3.1 Major steps for the training phase of our framework 43 3.2 The number of training examples vs. the average performance . 55 3.3 The performance comparison between difierent models. VW and VWC respectively denote video-words and video-word-clusters based methods, and VW Correl and VWC Correl are their corresponding correlgoram models. STPM denotes the Spatiotemporal Pyramid Matching approach. The dimension denotes the number of VW s and VWC s 58 3.4 The performance of the difierent bag of video-words related approaches. pLSA ISM is the major contribution of [111] . 59 4.1 The average accuracy (%) achieved using strong and weak classiflers . 75 4.2 The results achieved under difierent sampling space . 76 4.3 The average accuracy (%) achieved using strong and weak classiflers . 76 4.4 The performance (average accuracy %) of SPM using visual-words and intermediate concepts. SPM IC and SPM V denote SPM using intermediate concepts and visual-words respectively 79 4.5 The average classiflcation accuracy (%) obtained by various models (SCC, BOC, and SCC+BOC) 79 4.6 The MAP for the 28 LSCOM categories achieved by difierent approaches. BOV-O and BOV-D represent the BOV models with Nv = 3; 000 and Nv = 250 respectively. CC-BOC and pLSA-BOC denotes the BOC model created by co-clustering and pLSA respectively 82 5.1 Main steps of the action recognition framework . 99 6.1 Procedure of difiusion maps embedding . 127 xviii6.2 Performance comparisons between two vocabularies learnt from mid-level features with and without DM. embedding . 137 6.3 Performance comparison between two difierent midlevel feature representations: PMI vs. Frequency. embedding 137 6.4 Best results of difierent manifold learning techniques 137
كلمة سر فك الضغط : books-world.net The Unzip Password : books-world.net أتمنى أن تستفيدوا من محتوى الموضوع وأن ينال إعجابكم رابط من موقع عالم الكتب لتنزيل رسالة دكتوراه بعنوان Learning Semantic Features for Visual Recognition رابط مباشر لتنزيل رسالة دكتوراه بعنوان Learning Semantic Features for Visual Recognition 
|
|