[SOLVED] Preserve   in beautiful soup object

Issue

I have my sample.htm file as follows:

<html><head>
<title>hello</title>
</head>
<body>
<p>&nbsp; Hello! he said. &nbsp; !</p>
</body>
</html>

I have my python code as follows:

with open('sample.htm', 'r',encoding='utf8') as f:
    contents = f.read()
    soup  = BeautifulSoup(contents, 'html.parser')
    
with open("sample-output.htm", "w", encoding='utf-8') as file:
    file.write(str(soup))  

This reads the sample.htm and writes to another sample-output.htm

The output of the above:

<html><head>
<title>hello</title>
</head>
<body>
<p>  Hello! he said.   !</p>
</body>
</html>

How can i preserve the &nbsp; after writing to the file.

Solution

Read and follow basic docs: Output formatters

If you give Beautiful Soup a document that contains HTML entities like
&lquot;”, they’ll be converted to Unicode characters

If
you then convert the document to a string, the Unicode characters will
be encoded as UTF-8. You won’t get the HTML entities back


You can change this behavior by providing a value for the
formatter argument to prettify(), encode(), or decode()


If you pass in formatter="html", Beautiful Soup will
convert Unicode characters to HTML entities whenever possible
:

soup_string = soup.prettify(formatter="html")
print( soup_string)
<html>
 <head>
  <title>
   hello
  </title>
 </head>
 <body>
  <p>
   &nbsp; Hello! he said. &nbsp; !
  </p>
 </body>
</html>
print(type(soup_string)) # for the sake of completeness
<class 'str'>

Another way (no "prettify"):

print(soup.encode(formatter="html").decode())
<html><head>
<title>hello</title>
</head>
<body>
<p>&nbsp; Hello! he said. &nbsp; !</p>
</body>
</html>

Answered By – JosefZ

Answer Checked By – Katrina (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *