{"id":359,"date":"2016-07-06T18:07:00","date_gmt":"2016-07-06T09:07:00","guid":{"rendered":""},"modified":"2022-03-15T14:29:00","modified_gmt":"2022-03-15T05:29:00","slug":"nltknltkcleanhtmlnotimplementederror-to","status":"publish","type":"post","link":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/2016\/07\/nltknltkcleanhtmlnotimplementederror-to.html","title":{"rendered":"nltk\u306enltk.clean_html()\u3092\u4f7f\u3046\u3068\u300cNotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function\u300d\u3068\u3044\u3046\u30a8\u30e9\u30fc\u304c\u51fa\u308b\u5834\u5408\u306e\u5bfe\u51e6\u6cd5"},"content":{"rendered":"<div data-post-id=\"1297\" class=\"insert-page insert-page-1297 \"><div class=\"st-kaiwa-box kaiwaicon3 clearfix\"><div class=\"st-kaiwa-face\"><img decoding=\"async\" src=\"http:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-content\/uploads\/sites\/2\/2022\/03\/profile.jpg\" width=\"60px\"><div class=\"st-kaiwa-face-name\">\u6587\u5b57\u5b9f<\/div><\/div><div class=\"st-kaiwa-area\"><div class=\"st-kaiwa-hukidashi\">\u3053\u306e\u8a18\u4e8b\u306f\u3001\u682a\u5f0f\u4f1a\u793e\u30b7\u30b0\u30de\u30fb\u30c7\u30b6\u30a4\u30f3\u793e\u9577\u306e<a href=\"https:\/\/www.sigmadesign.co.jp\/president\/\" target=\"_blank\" rel=\"noopener\">\u6587\u5b57\u5b9f<\/a>\u304c\u57f7\u7b46\u3057\u307e\u3057\u305f\u3002<\/div><\/div><\/div>\n<\/div>\n<p>\u4e45\u3057\u3076\u308a\u306bNLTK\u3092\u4f7f\u3063\u3066\u30e9\u30a4\u30d0\u30eb\u30b5\u30a4\u30c8\u306e\u8abf\u67fb\u3092\u3057\u3088\u3046\u3068\u3057\u305f\u3089\u3001nltk.clean()\u306e\u884c\u3067\u4ee5\u4e0b\u306e\u30a8\u30e9\u30fc\u304c\u51fa\u305f\u3002<\/p>\n<div style=\"background-color: lightgray; padding: 10px;\">\n<p>NotImplementedError: To remove HTML markup, use BeautifulSoup&#8217;s get_text() function<\/p>\n<\/div>\n<p>\u8abf\u3079\u3066\u307f\u308b\u3068NLTK\u306e\u30d0\u30fc\u30b8\u30e7\u30f33\u4ee5\u4e0a\u304b\u3089\u306fnltk.clean_html()\u3068nltk.clean_url()\u304c\u4f7f\u3048\u306a\u304f\u306a\u3063\u305f\u3089\u3057\u3044\u3002<\/p>\n<p>\u305d\u306e\u4ee3\u308f\u308a\u306bBeautifulSoup\u306eget_text()\u3092\u4f7f\u3048\u3068\u3044\u3046\u3053\u3068\u3089\u3057\u3044\u3002<\/p>\n<p>\u305f\u3060\u3001BeautifulSoup\u306eget_text()\u306fJavaScript\u306e\u30bf\u30b0\u304c\u9664\u53bb\u3067\u304d\u307e\u305b\u3093\u3002<\/p>\n<p>\u3060\u304b\u3089JavaScript\u306e\u30b3\u30fc\u30c9\u3092\u9664\u53bb\u3059\u308b\u30b3\u30fc\u30c9\u3092\u81ea\u5206\u3067\u8ffd\u52a0\u3057\u306a\u3044\u3068\u3044\u3051\u307e\u305b\u3093\u3002<\/p>\n<p>\u63a2\u3057\u3066\u307f\u308b\u3068\u3042\u308a\u307e\u3057\u305f\u3002<\/p>\n<p>\u81ea\u5206\u3067\u4ee5\u4e0b\u306eclean_html\u3068\u3044\u3046\u95a2\u6570\u3092\u8ffd\u52a0\u3057\u307e\u3057\u305f\u3002<\/p>\n<p>\u53c2\u8003\uff1a<a href=\"http:\/\/stackoverflow.com\/questions\/26002076\/python-nltk-clean-html-not-implemented\" target=\"_blank\" rel=\"noopener\">Python nltk.clean_html not implemented &#8211; Stack Overflow<\/a><\/p>\n<div style=\"background-color: lightgray; padding: 10px;\">\n<pre><code>\r\ndef clean_html(html):\r\n    \"\"\"\r\n    Copied from NLTK package.\r\n    Remove HTML markup from the given string.\r\n\r\n    :param html: the HTML string to be cleaned\r\n    :type html: str\r\n    :rtype: str\r\n    \"\"\"\r\n\r\n    # First we remove inline JavaScript\/CSS:\r\n    cleaned = re.sub(r\"(?is)&lt;(script|style).*?&gt;.*?(&lt;\/1&gt;)\", \"\", html.strip())\r\n    # Then we remove html comments. This has to be done before removing regular\r\n    # tags since comments can contain '&gt;' characters.\r\n    cleaned = re.sub(r\"(?s)<!--(.*?)-->[n]?\", \"\", cleaned)\r\n    # Next we can remove the remaining tags:\r\n    cleaned = re.sub(r\"(?s)&lt;.*?&gt;\", \" \", cleaned)\r\n    # Finally, we deal with whitespace\r\n    cleaned = re.sub(r\"\u00a0\", \" \", cleaned)\r\n    cleaned = re.sub(r\"  \", \" \", cleaned)\r\n    cleaned = re.sub(r\"  \", \" \", cleaned)\r\nreturn cleaned.strip()<\/code><\/pre>\n<\/div>\n<p>\u53c2\u8003\u306b\u3055\u305b\u3066\u3044\u305f\u3060\u3044\u305f\u30b5\u30a4\u30c8\u306e\u30b3\u30e1\u30f3\u30c8\u306e\u90e8\u5206\u3092\u898b\u308b\u3068\u3069\u3046\u3084\u3089\u4ee5\u524d\u306eclean_html\u95a2\u6570\u306e\u90e8\u5206\u3092\u305d\u306e\u307e\u307e\u6301\u3063\u3066\u304d\u305f\u3088\u3046\u3067\u3059\u306d\u3002<\/p>\n<p>\u3082\u3063\u3068\u8907\u96d1\u306a\u3053\u3068\u3092\u3057\u3066\u308b\u306e\u304b\u3068\u601d\u3063\u305f\u3051\u3069\u3001\u307b\u3093\u306e\u6570\u884c\u3067\u51e6\u7406\u3057\u3066\u3044\u308b\u3053\u3068\u306b\u3073\u3063\u304f\u308a\u3002<\/p>\n<p>\u305d\u308c\u306b\u3057\u3066\u3082\u3081\u3093\u3069\u304f\u3055\u3044\u3002\u306a\u3093\u3067\u7121\u304f\u306a\u3063\u305f\u3093\u3060\u308d\u3046\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u6587\u5b57\u5b9f\u3053\u306e\u8a18\u4e8b\u306f\u3001\u682a\u5f0f\u4f1a\u793e\u30b7\u30b0\u30de\u30fb\u30c7\u30b6\u30a4\u30f3\u793e\u9577\u306e\u6587\u5b57\u5b9f\u304c\u57f7\u7b46\u3057\u307e\u3057\u305f\u3002<\/p>\n","protected":false},"author":1,"featured_media":598,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[75],"tags":[33,55],"class_list":["post-359","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-python","tag-55"],"_links":{"self":[{"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/posts\/359","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/comments?post=359"}],"version-history":[{"count":0,"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/posts\/359\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/media\/598"}],"wp:attachment":[{"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/media?parent=359"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/categories?post=359"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sigmadesign.co.jp\/minomonchan\/wp-json\/wp\/v2\/tags?post=359"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}