对于我的项目,我试图从维基百科读取数据,我不完全确定,我该怎么做。
我主要关注的是活动的阅读,日期,地点和主题。首先,我已经开始……
一种可能的解决方案是使用阅读网页 webread ,并使用来自的函数处理数据 文本分析工具箱 :
% Read HTML data. raw = webread('https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=text&page=91st_Academy_Awards'); % Specify sections of interest. SectionsOfInterest = ["Date","Site","Preshow hosts","Produced by","Directed by"]; % Parse HTML data. myTree = htmlTree(raw.parse.text.x_); % Find table element. tableElements = findElement(myTree,'Table'); tableOfInterest = tableElements(1); % Find header cell elements. thElements = findElement(tableOfInterest,"th"); % Find cell elements. tdElements = findElement(tableOfInterest,"td"); % Extract text. thHTML = thElements.extractHTMLText; tdHTML = tdElements.extractHTMLText; for section = 1:numel(SectionsOfInterest) sectionName = SectionsOfInterest(section); sectIndex = strcmp(sectionName,thHTML); % Remove spaces if present from section name. sectionName = strrep(sectionName,' ',''); % Clean up data. sectData = regexprep(tdHTML(sectIndex),'\n+','.'); % Create structure. s.(sectionName) = sectData; end
可视化输出结构:
>> s s = struct with fields: Date: "February 24, 2019" Site: "Dolby Theatre.Hollywood, Los Angeles, California, U.S." Preshowhosts: "Ashley Graham.Maria Menounos.Elaine Welteroth.Billy Porter.Ryan Seacrest. " Producedby: "Donna Gigliotti.Glenn Weiss" Directedby: "Glenn Weiss"